
OPERATIONAL DEFECT DATABASE
...

...
This solution is intended to provide additional information about the use and configuration of the Network Time Protocol (NTP) used by Avamar grids. It also provides further details and troubleshooting steps for problems arising during NTP configuration using the asktime utility. It covers the steps required to configure NTP for node additions or extra servers (such as Network Data Management Protocol (NDMP) accelerators).
Avamar uses NTP to maintain time synchronization against an external time source and across all Avamar Data Store (ADS) nodes. The ADS software bundle contains a utility asktime which is used to configure NTP. This can be a part of the setup process, or manually if required. The purpose of this article is to provide additional information and tips for troubleshooting asktime and NTP-related issues.
Notes: All commands listed in this article should be run from the Avamar Utility Node, as admin, with ssh keys loaded, unless stated. For more information about keys, see Avamar: How to Log in to an Avamar Server and Load Various Keys.Some output in this article is deliberately trimmed for brevity, in particular output of repetitive mapall commands.Any names, IP addresses referenced are, for example, only. These should be replaced with customer-specific details. Section #1 - Basic NTP functionality: 1. The basic schematic of NTP functionality within Avamar is as follows: All nodes (utility and storage) should poll one or more user or public NTP servers.All storage (data) nodes should poll the same user servers. In addition, the storage nodes poll the utility node (0.s) and the first storage node (0.0).The utility node runs in local time. All storage nodes run in UTC. 2. The intention here is that all nodes can maintain time independently. (In an external NTP server is unavailable the nodes continue to be able to maintain synchronization by using 0.s and 0.0.) Section #2 - Additional UNIX utilities for checking NTP functionality: 1. Additional UNIX utilities can be used to verify the status of the Network Time Protocol Daemon (NTPD) which runs on nodes to maintain the system time. service ntpd status/stop/start - verify that the Network Time Protocol daemon (NTPD) is running/stop and start as needed.date - show current system date and time.ntpdate - used to poll a remote NTP server and if required set the local system clock.ntpq - used to view current NTP connectivity. Section #3 - Additional Avamar utilities for checking NTP functionality: Using the "check.dpn --preinstall --checktime" script to automate running NTPD across all or specified nodes in order to verify that the node or nodes are running NTPD properly and has a time server selected.This functionality is also used by both the installer and the GSAN at startup and as such is a key indicator that NTPD is working as required. These commands, especially when used with mapall commands should be sufficient to debug most NTP-related errors. Section #4 - Troubleshooting NTP issues: 1. NTP issues typically occur immediately when performing new installs, as a result of a bad timeserver address provided by the customer, or due to firewall or routing issues. 2. NTP issues also arise when (after working properly for some time) changes are made to the customer network, an NTP server is removed and so on. Over time, this begins to affect the Avamar grid. 3. To minimize the risk of issues during the install, use the ntpdate program with the debug option (-d) to verify one or more assigned time servers are available and servicing requests. See APPENDIX A for complete sample output. In the example below, the handshake between the Avamar node and the time server can be seen. In this instance, the time server reporting a small offset of 0.000006 sec: ntpdate -d 168.xxx.xx.x offset 0.000006 29 Dec 15:47:15 ntpdate[10500]: adjust time server 168.xxx.xx.x offset 0.000006 sec Compare this with a timeserver which is unavailable. ntpdate -d 168.xxx.xx.x offset 0.000000 29 Dec 15:49:13 ntpdate[10699]: no server suitable for synchronization found In this example, it is clear that the time server is unavailable, and if asktime was to be run against this timeserver it would never be seen to synchronize. In this case work with the customer to verify that the assigned addresses are correct and that the NTP port (UDP 123) is not blocked by a firewall. 4. Do not rely on a simple ping test to verify the timeserver. The ping could be blocked by a Firewall with NTP unblocked, or vice versa. Ntpdate is effectively a replacement for ping when working with NTP in order to verify connectivity. 5. All nodes must be able to communicate with the external time servers, and this can be verified using the Avamar mapall command:(Assuming that Avamar is installed and that the probe.xml file properly configured.) mapall --all --user=root ntpdate -d 168.xxx.xx.x 6. Review the output and verify that all nodes can communicate with the time server per the examples above. If all nodes are communicating as necessary, then run ntpdate without the "-d" flag to actually update the system time (assuming NTPD is not already running): mapall --all --user=root ntpdate 168.xxx.xx.x Using /usr/local/avamar/var/probe.xml (0.s) ssh -x root@10.x.xxx.xxx 'ntpdate 168.xxx.xx.x' 29 Dec 17:40:41 ntpdate[23552]: adjust time server 168.xxx.xx.x offset 0.014792 sec (0.0) ssh -x root@10.x.xxx.xxx 'ntpdate 168.xxx.xx.x' 30 Dec 01:40:42 ntpdate[18131]: adjust time server 168.xxx.xx.x offset 0.029407 sec (0.1) ssh -x root@10.x.xxx.xxx 'ntpdate 168.xxx.xx.x' 30 Dec 01:40:43 ntpdate[16250]: adjust time server 168.xxx.xx.x offset 0.000689 sec Note: Restarting the NTPD service (service NTPD restart) calls ntpdate to perform similar steps. However when the service is restarted from within asktime there is no immediate indicator that the ntpdate command has succeeded or failed. Ideally asktime should be run separately to verify that the connectivity is good. Section #5 - NTP server selection: The NTP services on a customer site are likely managed by the network team, whereas the backups (Avamar) are more likely managed by a server team. The server team may not know about NTP servers, or any Firewall changes required to allow connectivity, or change request requirements and so on. For this reason, plan ahead of the installation and ensure that the customer knows what is required in advance. If there are no obvious NTP servers available: Many routers are configured to run NTP, so try, for instance, the default gateway IP. If it responds, then ensure that it does have the correct time!Windows Active Directory (AD) servers run NTP by default. However these can be running on Virtual Machines which are often unreliable. If ntpdate reports a good connection but NTPD is unable to establish a good synchronization with the target, work with the customer to find a good local time source.If a time source can be located always confirm with the customer that it is appropriate before configuring Avamar to use it.Where possible, try to locate NTP servers within the local campus where possible to minimize issues caused by slow or high latency connections.Where possible try to use multiple NTP servers. The more servers that are available, the better NTP can work to compare the various servers and establish the most accurate time. Note: Avamar can work with no time servers: The data nodes synchronize with the Avamar Utility Node, however the entire grid will eventually suffer drift over a period of time. Note: check.dpn (and hence the install and GSAN startup) warns if there are fewer than three time serves selected. This is only a warning rather than an error but do attempt to configure multiple servers where possible. Section #6 - Further troubleshooting: Occasionally, despite ntpdate initially showing a good timeserver handshake, NTPD consistently fails to use the timeserver as an authoritative time source. This can be verified by the output of ntpq (details below): Frequently the output of ntpdate can be used to demonstrate this to the customer, for instance: Rerunning "ntpdate -d <timeserver>" shows a fluctuating offset value indicating that the time offered is inconsistent (see the previous comment about NTP running on Virtual Machines (VMs)) or that a high network latency is causing issues.Running ntpdate against a couple of different NTP servers consistently shows that they are reporting a different time; this may only be by a matter of seconds but can be good enough that NTPD rejects both servers as valid candidates.Identify good local timeservers. If this occurs during a build and is impacting the schedule, then consider building without any time servers (the system reverts to the utility node as a time source) and reconfigure the time servers later on using asktime once they can be correctly validated. Section #7 - Using ntpq to inspect current NTPD status: Note: A full explanation of the ntpq output data is beyond the scope of this article however can be referenced here: http://doc.ntp.org (external link) The ntpq utility can be used to view current NTPD configuration and clock selection once asktime has been run, either manually or as a part of the build process. A typical output is as follows: ntpq -p remote refid st t when poll reach delay offset jitter ============================================================================== +d-host.company.com 168.xxx.xx.xx 3 u 59 64 377 78.917 -6.205 8.690 +e-host.company.com 168.xxx.xx.xx 3 u 54 64 377 77.521 -4.340 8.744 +f-host.company.com 168.xxx.xx.xx 3 u 58 64 377 78.063 -1.381 10.317 +g-host.company.com 168.xxx.xx.xx 3 u 49 64 377 77.723 -6.972 8.570 *h-host.company.com 128.xx.xx.xx 2 u 49 64 377 77.003 -7.736 8.511 +i-host.company.com 130.xxx.xxx.xxx 2 u 42 64 377 78.341 -1.701 9.984 j-host.company.com .INIT. 16 u - 256 0 0.000 0.000 4000.00 LOCAL(0) LOCAL(0) 8 l 51 64 377 0.000 0.000 0.001 If ntpq is initially slow to respond this may be because it is trying to resolve names of time servers against a badly configured Domain Name System (DNS) configuration. If so, run ntpq with the -n flag to skip name lookups. However, try to also establish why DNS is not resolving names and fix up accordingly: ntpq -pn remote refid st t when poll reach delay offset jitter ============================================================================== +128.xxx.xx.xx 168.xxx.xx.xx 3 u 63 64 377 78.917 -6.205 8.690 +128.xxx.xx.xx 168.xxx.xx.xx 3 u 58 64 377 77.521 -4.340 8.744 +10.xxx.xxx.xx 168.xxx.xx.x 3 u 62 64 377 78.063 -1.381 10.317 +10.xxx.xxx.xx 168.xxx.xx.xx 3 u 53 64 377 77.723 -6.972 8.570 *168.xxx.xx.x 128.xx.xx.xx 2 u 53 64 377 77.003 -7.736 8.511 +168.xxx.xx.xx 130.xxx.xxx.xxx 2 u 46 64 377 78.341 -1.701 9.984 168.xxx.xx.x .INIT. 16 u - 256 0 0.000 0.000 4000.00 127.xxx.x.x LOCAL(0) 8 l 55 64 377 0.000 0.000 0.001 A full discussion of the ntpq output is beyond the scope of this document however ntpq, and the other NTP programs are well documented at http://doc.ntp.org (external link). The key to Avamar working is the selection of a good time server, and this is indicated by the asterisk in the leftmost column. This is i-host.company.com or 168.xxx.xx.x in the examples above. The previous example is for a utility node. The following example shows a storage node which is also attempting to use the utility node (0.s) and first storage node (0.0) as time sources: ntpq -p remote refid st t when poll reach delay offset jitter ============================================================================== +d-host.company.com 168.xxx.xx.xx 3 u 36 128 377 78.403 15.627 26.941 +e-host.company.com 168.xxx.xx.xx 3 u 82 128 377 77.740 10.448 23.707 +f-host.company.com 168.xxx.xx.x 3 u 89 128 377 77.982 -16.786 18.895 +g-host.company.com 168.xxx.xx.xx 3 u 40 128 377 78.565 3.230 16.925 +h-host.company.com 10.xxx.x.xx 2 u 96 128 377 78.082 0.369 17.982 *i-host.company.com 128.xx.xx.xx 2 u 35 128 377 77.954 16.410 26.429 j-host.company.com .INIT. 16 u - 256 0 0.000 0.000 4000.00 +utility.company.com 168.xxx.xx.x 3 u 34 128 377 0.226 -1.589 15.290 +sn1.company.com 168.xxx.xx.xx 3 u 97 128 377 0.214 -6.072 31.263 From this output: There is no way to predetermine in this configuration which time server is selected as the authoritative time source (shown in the ntpq -p output in the left column). The main objective is to ensure that all defined time servers are correctly selected and available. The ntpd service is responsible for deciding which time servers are to be used.j-host.company.com is not servicing time requests. This is evident by there being no character in the left column, and also the various other states being at their startup defaults; a connection has never been made in order to start the time adjustment process. This is the kind of issue that testing with ntpdate helps to determine beforehand.In the output from the utility node (utility.company.com) and storage node 0.0 (sn.company.com) are also present in the list of valid servers as described earlier. Note: The Utility (0.s) and first storage (0.0) nodes will not offer time requests until they are fully stabilized, so these servers may be marked as INIT for some time after the initial configuration as they will not respond to requests until they themselves are correctly synchronized. Section #8 - Configuring time on additional nodes: Node Adds: Configuring time services on additional storage nodes to be added to a grid is well documented in the Procedure Generator "Capacity Upgrade Procedures" section. Other nodes (accelerator and so forth): It is imperative that the time is correctly configured on additional nodes such as accelerators in order that backup timestamps are correctly recorded. This is a manual process: a. As root, copy /etc/ntp.conf from an existing storage node (other than 0.0) to the accelerator. This gives the node detail of the external timeservers and the key (0.0 and 0.s) grid timeservers. b. As root, edit /etc/ntp.conf on both the utility node (0.s) and the first storage node (0.0).(The ntp.conf file defines runtime parameters for NTP.) The new node IP must be added into the access control list on nodes 0.s and 0.0 to allow them to respond to requests from the accelerator: # - - - - - # Inpidual DPN node restrictions - they can listen, but they can't # change us, except as above. # restrict 10.x.xxx.xxx nomodify restrict 10.x.xxx.xxx nomodify restrict 10.x.xxx.xxx nomodify restrict <Accelerator server IP> nomodify c. Once the accelerator node is added, restart the ntpd service on both the utility node (0.s) and node 0.0 in order to re-read the configuration file: mapall --nodes=0.0,0.s --user=root service ntpd restart d. On the new node, as root, change the service configuration to automatically run ntpd at boot-up: chkconfig --level 35 ntpd on e. Start ntpd on the new node: service ntpd start ntpd: Synchronizing with time server: [ OK ] Starting ntpd: [ OK ] f. The accelerator node reports in local time. The time zone is controlled by a file /etc/localtime. This is what gets modified by asktime when setting the time zone. The simplest way to set it is to copy it straight from the utility node: scp /etc/localtime root@accelerator:/etc/localtime g. Use the date command on the new accelerator node to verify that the correct time and time zone are being reported. Section #9 - Timesync issues during normal operations: NTP is very reliable unless there is a change in network configuration, time server IPs are modified, and so on. As Avamar syncs its nodes to the utility and first storage nodes (0.0 and 0.s), changes can be made and not actually become a problem for some time. If nodes are misconfigured, they may not be able to synchronize with the utility node or first storage node and will eventually fall out of sync with the other data nodes. The grid (GSAN) checks that time synchronization is appropriate during the startup of each maintenance activity. If the time discrepancy is different by more than two seconds across any node, then the activity fails. A message similar to the following can be seen in the err.log file on one or more nodes: 2010/12/30-02:23:10.57646 {0.3} [cpman:3411] WARN: <0980> samconn::dpntimecheck retrying dpn time check mytime=1293675790 2010/12/30-02:23:10.57712 {0.3} [cpman:3411] WARN: <0980> samconn::dpntimecheck retrying dpn time check mytime=1293675790 2010/12/30-02:23:10.57782 {0.3} [cpman:3411] ERROR: <0001> samconn::dpntimecheck time mismatch: synchronize clocks and retry Or this from status.dpn: Checkpoint failed with result MSG_ERR_BADTIMESYNC : cp.20101229150030 started Wed Dec 29 07:00:30 To resolve this issue, review the ntpq output as described above to determine which nodes are not able to synchronize and why. Work with the customer to see if there has been a recent network change which has caused this. One common cause of this is asktime having been run incorrectly, in that only new nodes were selected for modification; asktime correctly configures those nodes but does not update the ntp.conf access control lists on 0.s and 0.0 to add in the IPs of the new nodes. In addition, it does not restart NTPD on these nodes required to re-read the ntp.conf file. What this means in turn is that the new nodes will never synchronize time with the grid. If an external time server is specified, then they should sync time with that server (as should the grid) so both the grid and the new nodes will appear to have an authoritative time server. However, the additional nodes cannot sync to 0.0 and 0.s and so if the external time server becomes unavailable then they will eventually fall out of sync and fail.
Click on a version to see all relevant bugs
Dell Integration
Learn more about where this data comes from
Bug Scrub Advisor
Streamline upgrades with automated vendor bug scrubs
BugZero Enterprise
Wish you caught this bug sooner? Get proactive today.