...
In certain workflow environments, SyncIQ policies might fail due to their policy configuration if replication is being used in two or more network pools..To determine whether SyncIQ policy configurations meet the requirements for this issue. Sync job fails may fail with error: "SyncIQ is unable to connect to a resource on the target cluster. Use ping to determine if the target is reachable. Verify that SyncIQ is licensed and enabled on the target cluster, and that the SyncIQ daemons are running. No live connections for over 15 minutes" There are two different scenarios in this KB that can lead to the failures above.In order for that to happen, different SyncIQ policies must use different network pools in the cluster environment.To verify which pools SyncIQ policies are configured to use:FROM LOGS: # cat local/isi_sync_policy | egrep "ID:|Name:|Subnet:|Pool:" Example Output: ID: 57a93ec418a1d9f165c61313213bd160 Name: Policy_1 Source Subnet: Subnet0 Source Pool: replication-pool ID: 7462e4a77fc633ea59eaebecce59ffb0 Name: Policy_2 Source Subnet: Subnet0 Source Pool: replication-pool ID: 6c8e3153be6f8024ba2fa07e81bf9958 Name: Policy_3 Source Subnet: Subnet0 Source Pool: synciq-pool ID: 46cad1a91a3b34a0257c0b1051dff201 Name: Policy_4 Source Subnet: Subnet0 Source Pool: synciq-pool FROM LIVE: Scorpion-1# cat /ifs/.ifsvar/modules/tsm/config/siq-policies.gc |egrep "common.name|restrict_by" Example Output: [rodrir15@elvis 2024-02-15-001]$ cat local/ifsvar_modules.tar/modules/tsm/config/siq-policies.gc |egrep "common.name|restrict_by" policy.0.common.name {token:67} = "Policy_1" policy.0.target.restrict_by {token:52} = "subnet0:replication-pool" policy.1.common.name {token:67} = "Policy_2" policy.1.target.restrict_by {token:52} = "subnet0:replication-pool" policy.2.common.name {token:67} = "Policy_3" policy.2.target.restrict_by {token:52} = "subnet0:synciq-pool" policy.3.common.name {token:67} = "Policy_4" policy.3.target.restrict_by {token:52} = "subnet0:synciq-pool" Notice that Policies 1 and 2 are using the "replication-pool" pool while Policies 3 and 4 are using "synciq-pool" pool.Once SyncIQ policies meet the necessary requirements are confirmed, verify which of the following reasons is causing the failures.Scenario 1: The pools used for replication are using a different hardware type with different worker limits. How to verify worker limits for the nodes in the cluster:FROM LOGS: # echo -e "Worker\t # of workers" ; ls local/ifsvar_modules.tar/modules/tsm/config/worker_pool/ |grep -v lock |while read worker ; do echo -en $worker "\t" ; cat local/ifsvar_modules.tar/modules/tsm/config/worker_pool/$worker; echo "" ; done FROM LIVE: # echo -e "Worker\t # of workers" ; ls /ifs/.ifsvar/modules/tsm/config/worker_pool/ |grep -v lock |while read worker ; do echo -en $worker "\t" ; cat /ifs/.ifsvar/modules/tsm/config/worker_pool/$worker; echo "" ; done Example Output: Worker # of workers 1 24 2 24 3 24 4 24 5 80 6 80 total_rationed_workers 256 Notice that nodes 1-4 have worker limits of 24 workers per node; while nodes 5 and 6 have 80 workers per node limits.Scenario 2: In this situation each pool is configured with the same hardware type but there are a different number of nodes in each pool.Example Output: Worker # of workers 1 24 2 24 3 24 4 24 5 24 6 24 total_rationed_workers 144 Verify how many nodes are being used for replication.FROM LOGS:: # cat local/isi_network_pools FROM LIVE:: # isi network pools list -v Example Output: Groupnet: groupnet0 Subnet: Subnet0 Name: replication-pool Rules: - Access Zone: System Allocation Method: static Aggregation Mode: lacp SC Suspended Nodes: - Description: Ifaces: 1:25gige-agg-1, 2:25gige-agg-1, 3:25gige-agg-1, 4:25gige-agg-1 IP Ranges: 172.xx.xxx.x -172.xx.xxx.x , 172.xx.xxx.x -172.xx.xxx.x Groupnet: groupnet0 Subnet: Subnet0 Name: synciq-pool Rules: - Access Zone: System Allocation Method: static Aggregation Mode: lacp SC Suspended Nodes: - Description: Ifaces: 5:25gige-agg-1, 6:25gige-agg-1 IP Ranges: 172.xx.xxx.x -172.xx.xxx.x Notice that the "replication pool" uses nodes 1 through 4 while the "synciq pool" uses nodes 5 and 6.Once you have confirmed the cause, proceed to the solution below.
The SyncIQ bandwidth daemon looks at ALL participating nodes to determine how many workers a job can run, and rations workers evenly based on that total and the number of jobs in flight.If SyncIQ jobs are run concurrently in different node pools that support differing numbers of workers, it is possible for jobs to exhaust all workers in one pool starving new jobs of workers and leading to these job failures.
As a workaround, disable worker pooling. See KB article 21078, Isilon SyncIQ: How to disable worker pools and revert worker pooling to OneFS 7.x behaviorAnother option may be to reschedule jobs in different pools to run at different times and avoid overlapping run times. However, this may not be possible. It also does not guarantee the jobs do not experience the failures.