
OPERATIONAL DEFECT DATABASE
...

...
On any HPE Mellanox SwitchX-based, or SwitchX2-based Managed InfiniBand switches running Switch Management Software version 3.6.2002 (or earlier), a possibility exists that the NAND flash could become fragmented due to excessive logging by the embedded Master Subnet Manager (SM) over a period of time. The "garbage collection" process eventually becomes overwhelmed by the excessive logging to the embedded File System.This may result in symptoms such as high CPU utilization, random switch reboot, IPoIB connectivity issues, MLNX-OS update failure or the switch becoming unusually slow to respond.On the device affected, the Switch CLI/GUI response time may be more than 10 minutes and in some cases, the CLI/GUI may stop responding and the user cannot log into the switch.The issue is more pronounced in some environments as certain applications cause the switch Subnet Manager (SM) to be more verbose (Example: Certain multicast activity).
Any of the following switch systems running Switch Management Software version 3.6.2002 (or earlier):Mellanox IB QDR/FDR 648P Switch Chassis (HPE Part Number: 674277-B21)Mellanox IB QDR/FDR 324P Switch Chassis (HPE Part Number: 674278-B21)Mellanox IB QDR/FDR 216P Switch Chassis (HPE Part Number: 674279-B21)Mellanox IB QDR/FDR Modular Management Board (HPE Part Number: 674280-B21)Mellanox IB QDR Modular Fabric Board (HPE Part Number: 674281-B21)Mellanox IB FDR Modular Fabric Board (HPE Part Number: 674282-B21)Mellanox IB QDR Modular Line Board (HPE Part Number: 674283-B21)Mellanox IB FDR Modular Line Board (HPE Part Number: 674284-B21)Mellanox InfiniBand FDR 36P Managed Switch (HPE Part Number: 670769-B21)Mellanox InfiniBand FDR 36P RAF Managed Switch (HPE Part Number: 670770-B21)HP 4X FDR InfiniBand Managed Switch for BladeSystem c-Class (HPE Part Number: 648311-B21)Mellanox InfiniBand QDR/FDR10 36P Managed Switch (HPE Part Number: 712497-B21)Mellanox InfiniBand QDR/FDR10 36P RAF Managed Switch (HPE Part Number: 712498-B21)
To correct the issue, download and install Switch Management Software for Mellanox Infiniband version 3.6.3004 (or later). Switches running older Switch Management Software versions for Mellanox should be updated to version 3.6.3004.To download Switch Management Software for Mellanox Infiniband version 3.6.3004 (or later), perform the following steps:Click the following link:http://h20566.www2.hpe.com/portal/site/hpscEnter a product name (e.g., "Mellanox InfiniBand FDR 36P Managed Switch" or "670769-B21") in the text field underEnter a Product Name or Number.Click Go.Select the appropriate product model from the Results list (if prompted).Click the"drivers, software & firmware"hyperlink under the Download Options tab.Select the system's specific operating system from theOperating Systemsdropdown menu.Click the categoryFirmware - Network.Select the latest release ofSwitch Management Software for Mellanox Infiniband version 3.6.3004(or later).ClickDownload.If the switch or device is affected by any of the symptoms mentioned in the Description section, and it is not possible to perform the firmware update, perform the following steps as a workaround:If the switch CLI is functional, free up space on the Switch File System and reduce Subnet Manager logging as explained below:Delete unused image files on the switch using sftp. The images are located in "/var/opt/tms/images/" directory.Example:From a Linux box:# sftp admin@10.7.54.243 (where 10.7.54.243 is the switch Management IP address)Password:Connected to 10.7.54.243.sftp> cd /var/opt/tms/images/sftp> lsimage-PPC_M460EX-3.5.1000.img image-PPC_M460EX-3.5.1002.imgsftp> rm image-PPC_M460EX-3.5.1000.imgsftp> rm image-PPC_M460EX-3.5.1002.imgIf the CLI is accessible, log into the switch as "admin" and run the following command to reduce SM logging threshold. The setting still lets the SM continue to log any "genuine" error messages:# ssh admin@10.7.54.243# switch(config)> enable# switch(config)# configure terminal# switch(config)# ib sm log-flags errorIn addition, run the following command to reduce "smm" logging level:# switch(config)# logging local override class mlx-daemons priority errOnce done, monitor the switch CPU utilization and response time for next 24-48 hours. Continue using the switch if it recovers. Update to Switch Management Software version 3.6.3004 (or later) at the next available opportunity.If the switch does not recover or if the Command Line interface is too slow to run any of these commands, then call the HPE Support to open a support ticket for switch replacement under HPE warranty. Once the new switch is in production, immediately run the log reduction commands as explained in Step-1, and update to Switch Management Software version 3.6.3004 (or later) at the next available opportunity.Click on the following URL to locate the HPE Customer Support phone number in your country:https://h20195.www2.hpe.com/v2/Getdocument.aspx?docname=A00039121ENWRECEIVE PROACTIVE UPDATES: Receive support alerts (such as Customer Advisories), as well as updates on drivers, software, firmware, and customer replaceable components, proactively via e-mail through HPE Subscriber's Choice. Sign up for Subscriber's Choice at the following URL:Proactive Updates Subscription Form.NAVIGATION TIP: For hints on navigating HPE.com to locate the latest drivers, patches, and other support software downloads for ProLiant servers and Options, refer to theNavigation Tips document.SEARCH TIP: For hints on locating similar documents on HPE.com, refer to theSearch Tips document.
Operating Systems Affected:Not Applicable
No external links available for this bug
Click on a version to see all relevant bugs
Hewlett Packard Enterprise Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.