...
RecoverPoint for Virtual Machines, (RP4VMs,) requires constant access to both its specific Repository LUNs and the corresponding journal LUNs that are created for each Consistency Group (CG). The vRPAs access these volumes using the Jiraf module (JAM) running on the ESXi host they are on.An issue can occur where intermittent access to both the Repository and Journal LUNs may cause Data Replication Unavailability (DRU) and leave virtual machines unprotected during this time.The following sequence is witnessed in both the Control and Replication logs of the site control virtual Recover Point Appliance, (vRPA), when access to these LUNs is impacted: I/Os to the JIRAF module (running on each ESXi host that vRPAs reside on) timeout on the vRPA side, followed by the vRPA itself timing out trying to read responses from the JIRAF module. From control logs: 2019/08/18 05:34:51.318 - #1 - 3960/3909 - SocketInfoJIRAF::isFDReady: poll timeout errno = 0 a_expireTimeUsecs = 852155604014( m_lr=(0xXXXXXXXXXX,0xXXXXXXXXXe_JIRAF) m_handle=0 m_openCount=1 m_status=e_OK m_cidPort = 2:5050 m_afVMCI = 40 m_sockFD = 149)2019/08/18 05:36:53.933 - #1 - 3953/3909 - SocketInfoJIRAF::isFDReady: poll timeout errno = 0 a_expireTimeUsecs = 852278255722( m_lr=(0xXXXXXXXXXX,0xXXXXXXXXXe_JIRAF m_handle=0 m_openCount=1 m_status=e_OK m_cidPort = 2:5050 m_afVMCI = 40 m_sockFD = 149) Partial messages are sent to the JIRAF module on the ESXi host, and the vRPA fails trying to send more to this module: From control logs: 2019/08/15 01:02:43.410 - #1 - 4773/4734 - SocketInfoJIRAF::sendData: send byte count mismatch( m_lr=(0xXXXXXXXXXX,0xXXXXXXXXXe_JIRAF) m_handle=0 m_openCount=6 m_status=e_OK m_cidPort = 2:5050 m_afVMCI = 40 m_sockFD = 60) bytes_sent = 261883 a_num_bytes = 10485762019/08/15 04:29:42.574 - #1 - 9455/9402 - SocketInfoJIRAF::sendData: send byte count mismatch( m_lr=(0xXXXXXXXXXX,0xXXXXXXXXXe_JIRAF) m_handle=0 m_openCount=6 m_status=e_OK m_cidPort = 2:5050 m_afVMCI = 40 m_sockFD = 64) bytes_sent = 19759 a_num_bytes = 225792 The following sequence is seen in the JIRAF logs, located on each ESXi host under /scratch/log/iofilterd-emcjiraf.log, with the JIRAF module reading partial messages: 2019-08-03T02:03:59Z iofilterd-emcjiraf[2308573]: jiraf_receive_msg: unknown cmd type The following sequence is also seen in the ESXi splitter logs, located under /scratch/log/kdriver.log.xxxxxxxx: 2019-07-30T17:36:32*Z iofilterd-emcjiraf[2099635]: IoStats_s_printStats: total 2 IOs over 90 seconds. average time to start 6us, pending 555us, processing 18us These "I/Os over X seconds," print should be every 60 seconds. If 60 seconds is not the value in here, the issue is being encountered.
The JAM (emcjiraf) module is responsible for maintaining repository and journal access. As part of its normal operation, it does a RPVS discovery process to ensure it knows which datastores, storage and VMDKs are available to it. In addition to this ongoing RPVS discovery process, network operations are also running in tandem. If the discovery process takes too long to complete, the network operations may not run frequently enough, resulting in the loss of access to the repository volume, journal volumes, or both.
This issue is addressed in the RecoverPoint for Virtual Machine version 5.2.2.1 and later.
Click on a version to see all relevant bugs
Dell Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.