Loading...
Loading...
The most common symptom is a high volume of ICMP pings originating from the PowerScale cluster, typically from all nodes with frontend connectivity. In environments with network monitoring, this may appear as a steadily increasing amount of outbound ping traffic. If traffic is not monitored, the issue may surface during other troubleshooting as many running ping processes. In rare cases, accumulated ICMP traffic can place a sustained load on the network. Because this buildup is gradual, early symptoms may be mild, but over time can lead to packet loss, degraded performance, and connection failures. Packet loss is typically observed between the PowerScale cluster and clients, which can make root cause identification difficult. PowerScale may appear to be operating, while traffic is dropped elsewhere in the network. Packet captures often show traffic leaving the cluster as expected, with loss occurring further along the network path. Network teams may also observe increasing Input Discards on switches, indicating congestion. Because these symptoms have multiple causes, checking the cluster for long‑running ping processes is recommended to rule out ICMP‑related congestion. The pings that are running from the cluster looks like this: root 37924 0.0 0.0 16492 2372 - S 24Jun21 23:29.46 /sbin/ping -oD -S 192.168.1.20 -s 8500 192.168.2.50 See the flags used and size of the ping being run. Reviews of ping counts across all nodes indicate that there are many ping instances running from multiple nodes: MyCluster-5# isi_for_array 'pgrep -f /sbin/ping | wc -l' MyCluster-5: 280 MyCluster-3: 280 MyCluster-1: 281 MyCluster-4: 274 MyCluster-8: 279 MyCluster-2: 298 MyCluster-7: 280 MyCluster-6: 285 (In this case, 2000+ pings running as they had been building up for almost 11 months)
The issue originates from an older synciq_packet_fragmentation HealthCheck used to test jumbo frame fragmentation for SyncIQ. The ping can look like this: /sbin/ping -oD -S 192.168.1.20 -s 8500 192.168.2.50 The flags involved are: -o: Exit successfully after receiving one reply packet. -D: Do not fragment -S 192.168.1.20: Specifies the source IP to ping from (will be different on each node) -s 8500: Specifies the number of bytes to be sent (8500 bytes) The issue occurs because the -o flag is used, and in some environments failed pings never receive a reply. This means that the ping continues running indefinitely and over time this healthcheck item is performed it spawns more instances of pings. Running the healthcheck on a schedule can lead to large amounts over an extended period of time. This issue is rare because there are many things that must line up for this to happen: Cluster must be running an older Health check framework (Pre-2022 releases) The cluster must be a SyncIQ source Cluster must use MTU 9000 (Jumbo Frames) 8500-byte pings with DF (do not fragment) bit set must fail from cluster to SyncIQ target Health check item " synciq_packet_fragmentation " must be on a scheduled checklist Cluster has a long uptime (issue builds up over time)
If the issue is present, immediate relief can be provided by stopping the running ping processes using: isi_for_array -X 'killall ping' This stops all running pings and since the issue builds up slowly over long periods of time there is not any risk of immediate recurrence. The issue is fixed in the latest release of the Health check framework. All OneFS versions newer than and including 8.2.2.0 have a fix for this issue in the latest HealthCheck framework patch (HealthCheck 32.0.6 Release and newer). If upgrading the framework is not an option, this HealthCheck item can be disabled as a workaround. Contact Dell Support and reference this KB for assistance.
Click on a version to see all relevant bugs
Dell Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.