...
The current Isilon On-Cluster Analysis tool (IOCA) script generates the following warnings related to the /var partitions: System Partition Free Space FAIL CRITICAL: The following nodes have /var mirrors of different sizes: 1-10 INFO: Please reference KB 000213248 (https://www.dell.com/support/kbdoc/000213248) for further information. INFO: For more information refer to KB article 000041465 found at https://www.dell.com/support/kbdoc/000041465. Or: Mirror Status FAIL CRITICAL: The mirror pair for var1 appear to be in the same fault domain on nodes: 1-10. CRITICAL: The following nodes have /var mirrors of different sizes: 1-10 INFO: Please open a Technical Support Service Request and reference this failure within the description. INFO: Please refer to KB 000213248 (https://www.dell.com/support/kbdoc/en-us/000213248) for further information. If ignored, it is possible that a panic may occur on a Gen6 or Gen6 MLK node during a drive or sled replacement procedure.When the node comes up, reviewing the /var/log/messages file for the node shows errors and panic messages similar to the following: (da21:pmspcbsd0:0:22:0): pccb 0xfffffe8543174480, ccb 0xfffff80e31347000: ccbStatus 3, scsiStatus 5 (da22:pmspcbsd0:0:23:0): pccb 0xfffffe8543151fe0, ccb 0xfffff807c4753000: ccbStatus 3, scsiStatus 5 (da21:pmspcbsd0:0:22:0): WRITE(10). CDB: 2a 00 00 04 dd 44 00 00 04 00 (da21:pmspcbsd0:0:22:0): CAM status: CCB request aborted by the host (da22:pmspcbsd0:0:23:0): WRITE(10). CDB: 2a 00 00 04 dd 44 00 00 04 00 (da21:pmspcbsd0:0:22:0): Retrying command, 3 more tries remain (da22:pmspcbsd0:0:23:0): CAM status: CCB request aborted by the host (da21:pmspcbsd0:0:22:0): pccb 0xfffffe8543151fe0, ccb 0xfffff80e31347000: ccbStatus 3, scsiStatus 2 (da22:pmspcbsd0:0:23:0): Retrying command, 3 more tries remain (da21:pmspcbsd0:0:22:0): cam_periph_error: SSQ_LOST removing device ccb 0xfffff80e31347000 status 0x8 flags 0x2 (da22:pmspcbsd0:0:23:0): pccb 0xfffffe8543151fe0, ccb 0xfffff807c4753000: ccbStatus 3, scsiStatus 2 (da21:pmspcbsd0:0:22:0): Invalidating pack (da22:pmspcbsd0:0:23:0): cam_periph_error: SSQ_LOST removing device ccb 0xfffff807c4753000 status 0x8 flags 0x2 (da22:pmspcbsd0:0:23:0): Invalidating pack (da21:pmspcbsd0:0:22:0): removing device entry (da21:pmspcbsd0:0:22:0): Periph destroyed panic @ time 1681142660.493, thread 0xfffffe874ee12000: mirror/var1: all devices failed (read, offset 1304707072, length 0) time = 1681142660 cpuid = 3, TSC = 0x5e76342b8e8e59 Panic occurred in module kernel loaded at 0xffffffff80200000: Stack: -------------------------------------------------- kernel:g_mirror_worker+0x251f kernel:fork_exit+0x82 -------------------------------------------------- Disabling swatchdog Dumping stacks (40960 bytes)
In versions 9.2.1.16 and later, 9.4.0.6 and later, and 9.5.0.0 and later, OneFS expands the /var partition size to 2 GB during an upgrade. The expansion process may cause these messages or events to be triggered.The message related to the mirror pair being in the same fault domain is specific to Gen6 and Gen6 MLK nodes (A200, A2000, A300, A3000, H400, H500, H600, H5600, H700, H7000, F800, and F810). The expansion may not validate partition fault domains properly, putting both partitions for either /var mirror on drives in the same drive sled. This causes the node to panic when the sled is removed if both of the mirrors for the mounted /var partition are in the sled.Either mirror/var0 or mirror/var1 serves as the active /var partition mirror at any time. From the panic message in the example, we see that /mirror/var1 was the active mirror. Looking at the gmirror status and isi devices drive list command outputs for the node:Truncated status command output: gmirror status mirror/var1 COMPLETE da14p3 (ACTIVE) <<<< da13p3 (ACTIVE) The drive list command output: isi devices drive list Lnn Location Device Lnum State Serial Sled --------------------------------------------------------- 21 Bay 1 /dev/da1 15 L3 xxxxxxxxxxxx N/A 21 Bay 2 /dev/da2 16 L3 xxxxxxxxxxxx N/A 21 Bay A0 /dev/da5 12 HEALTHY xxxxxxxx A 21 Bay A1 /dev/da4 13 HEALTHY xxxxxxxx A 21 Bay A2 /dev/da3 14 HEALTHY xxxxxxxx A 21 Bay B0 /dev/da8 9 HEALTHY xxxxxxxx B 21 Bay B1 /dev/da7 10 HEALTHY xxxxxxxx B 21 Bay B2 /dev/da6 11 HEALTHY xxxxxxxx B 21 Bay C0 /dev/da11 6 HEALTHY xxxxxxxx C 21 Bay C1 /dev/da10 7 HEALTHY xxxxxxxx C 21 Bay C2 /dev/da9 8 HEALTHY xxxxxxxx C 21 Bay D0 /dev/da14 3 HEALTHY xxxxxxxx D <<<< 21 Bay D1 /dev/da13 4 HEALTHY xxxxxxxx D <<<< 21 Bay D2 /dev/da12 5 HEALTHY xxxxxxxx D 21 Bay E0 /dev/da17 0 HEALTHY xxxxxxxx E 21 Bay E1 /dev/da16 1 HEALTHY xxxxxxxx E 21 Bay E2 /dev/da15 2 HEALTHY xxxxxxxx E --------------------------------------------------------- In this example, /mirror/var1 is built on drives D0 and D1. When the D sled is removed from the cluster, the node panicked due to the inability to access the /var file system.The different sizes of the var partitions can affect any Isilon or PowerScale node type that runs OneFS 9.x. The expansion for /var only expands the active /var partition on the system. The Last Known Good (LKG) partition remains the original size. If the active /var partition is expanded and filled to greater than 50%, this may cause issues if the partition must be rotated for maintenance.To determine if this issue affects an LKG partition on the cluster, use the following command: # isi_for_array -sX 'gmirror list var0 var1' | grep -A20 mirror | egrep "var|Media" Example: lab-1# isi_for_array -sX 'gmirror list var0 var1' | grep -A20 mirror | egrep "var|Media" lab-1: 1. Name: mirror/var0 lab-1: Mediasize: 2147479552 (2.0G) lab-1: Mediasize: 2147483648 (2.0G) lab-1: Mediasize: 2147483648 (2.0G) lab-2: 1. Name: mirror/var0 lab-2: Mediasize: 2147479552 (1.0G) <<<< lab-2: Mediasize: 2147483648 (1.0G) <<<< lab-2: Mediasize: 2147483648 (1.0G) <<<< lab-3: 1. Name: mirror/var0 lab-3: Mediasize: 2147479552 (2.0G) lab-3: Mediasize: 2147483648 (2.0G) lab-3: Mediasize: 2147483648 (2.0G) lab-4: 1. Name: mirror/var0 lab-4: Mediasize: 2147479552 (2.0G) lab-4: Mediasize: 2147483648 (2.0G) lab-4: Mediasize: 2147483648 (2.0G) .... If any of the devices come back with a size of (1.0G), it is affected.In the above example, lab-2's var0 partitions are affected and must be fixed. Rerun the command for var1 across the cluster to determine if it is also affected.
The permanent fix for the FAULT DOMAIN ISSUE ONLY is in the following code releases: OneFS 9.5.0.6 9.4.0.16 9.2.1.25 The permanent fix for the different size /var partitions is being worked on. To resolve the issue, follow the scripted process. If you are unable to upgrade, follow the scripted process. Note: This issue cannot be resolved manually or using the script below on clusters running in compliance mode. If the cluster is in SmartLock Compliance mode, this issue can be remediated by upgrading to the OneFS versions listed above. If unable to upgrade to remediate this issue, contact Dell Support and request a DA patch. There is a script available to address this issue. Contact Dell Support to receive the files then follow the instructions below. To use the script: Download the script and the md5 file to the cluster. Copy the files to /ifs/data/Isilon_Support on the cluster and confirm that the md5 hash matches the hash in the md5 file: Lab-1# mv var_mirror_repair.sh /ifs/data/Isilon_Support/ Lab-1# mv var_mirror_repair.md5 /ifs/data/Isilon_Support/ Lab-1# md5 /ifs/data/Isilon_Support/var_mirror_repair.sh MD5 (/ifs/data/Isilon_Support/var_mirror_repair.sh) = 0881afeeb39fdaf02e2a90d784e4ed21 Lab-1# # cat /ifs/data/Isilon_Support/var_mirror_repair.md5 0881afeeb39fdaf02e2a90d784e4ed21 If the hash does not match, download the script from the FTP site and copy it to the cluster again. If the hash matches, run the following command as root: lab-1# sh /ifs/data/Isilon_Support/var_mirror_repair.sh The script usually takes 5-10 minutes to run. It can take longer on large (30+ nodes) or busy clusters. When the script completes, it reports if it was successful or if there were issues. When you launch the script, you see the following output: Lab-1# sh /ifs/data/Isilon_Support/var_mirror_repair.sh Full output can be found at: /ifs/data/Isilon_Support/var_mirror_repair.FULL_CLUSTER.2023-10-19T092522.csv Status: 0/4 Nodes checked, 0/4 var0 partitions, 0/4 var1 partitions As the script progresses, the Status line updates: Status: 4/4 Nodes checked, 4/4 var0 partitions, 4/4 var1 partitions When the script completes successfully, you see the following: Status: 4/4 Nodes checked, 4/4 var0 partitions, 4/4 var1 partitions No issues were identified. Moving files to: var_mirror_repair.2023-10-19T092522.d Bundle Location: var_mirror_repair.2023-10-19T092522.tgz This indicates that any impacted nodes were repaired and the script had no issues. If the output received is different than this example, contact Dell Support and provide the output and the log files from the bundle location.