From aa83b956af98efd719ed79f74b208670db124340 Mon Sep 17 00:00:00 2001 From: Marcin Siodelski Date: Tue, 22 May 2018 19:47:40 +0200 Subject: [PATCH 1/3] [5603] Documented clock skew in HA and "terminated" state. --- doc/guide/hooks-ha.xml | 61 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 61 insertions(+) diff --git a/doc/guide/hooks-ha.xml b/doc/guide/hooks-ha.xml index 66ca7621ee..84e65e99b2 100644 --- a/doc/guide/hooks-ha.xml +++ b/doc/guide/hooks-ha.xml @@ -95,6 +95,41 @@ +
+ Clocks on Active Servers + Synchronized clocks are essential for the HA setup to operate + reliably. The servers share lease information via lease updates and + during synchronization of the databases. The lease information includes + the time when the lease has been allocated and when it expires. Some + clock skew between the servers participating the HA setup would usually + exist. This is acceptable as long as the clock skew is relatively low, + comparing to the lease lifetimes. However, if the clock skew becomes too + high, the different notion of time for the lease expiration by different + servers may cause the HA system to malfuction. For example, one server + may consider valid lease to be expired. As a consequence, the lease reclamation + process may remove a name associated with this lease from the DNS, even though + the lease may later get renewed by a client. + + Each active server monitors the clock skew by comparing its current + time with the time returned by its partner in response to the heartbeat + command. This gives a good approximation of the clock skew, although it + doesn't take into account the time between sending the response by the + partner and receiving this response by the server which sent the + heartbeat command. If the clock skew exceeds 30 seconds, a warning log + message is issued. The administrator may correct this problem by + synchronizing the clocks (e.g. using NTP). The servers should notice + the clock skew correction and stop issuing the warning + + If the clock skew is not corrected and it exceeds 60 seconds, the + HA service on each of the servers is terminated, i.e. the state + machine enters the terminated state. The servers + will continue to respond to the DHCP clients (as in the load-balancing + or hot-standby mode), but will neither exchange lease updates nor + heartbeats and their lease databases will diverge. In this case, the + administrator should synchronize the clocks and restart the servers. + +
+
Server States The DHCP server operating within an HA setup runs a state machine @@ -167,6 +202,26 @@ answer from the partner and is not doing anything else while the leases synchronization takes place. + terminated - an active server + transitions to this state when the High Availability hooks library + is unable to further provide reliable service and a manual + intervention of the administrator is required to correct the problem. + It is envisaged that various issues with the HA setup may cause the + server to transition to this state in the future. As of Kea 1.4.0 + release, the only issue causing the HA service to terminate is + unacceptably high clock skew between the active servers, i.e. if the + clocks on respective servers are more than 60 seconds apart. + While in this state, the server will continue responding to the + DHCP clients based on the HA mode selected (load balancing or + hot standby), but the lease updates won't be exchanged and the + heartbeats won't be sent. The server which got into the + "terminated" state will remain in this state until it is + restarted. The administrator must eliminate the issue which caused + this situation prior to restarting the server (synchronize clocks). + Otherwise, the server will return to the "terminated" state as + soon as it finds that the clock skew is still too high. + + waiting - each started server instance enters this state. The backup server will transition directly from this state to the backup state. @@ -245,6 +300,12 @@ disabled none + + terminated + active server + enabled + same as in the load-balancing or hot-standby state + waiting any server From 5143d5d42c602ebca3cd2b53e3881a8dadf4883b Mon Sep 17 00:00:00 2001 From: Thomas Markwalder Date: Tue, 22 May 2018 14:25:05 -0400 Subject: [PATCH 2/3] [5603] Minor ha documentation wording changes --- doc/guide/hooks-ha.xml | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/doc/guide/hooks-ha.xml b/doc/guide/hooks-ha.xml index 84e65e99b2..80bb7a696d 100644 --- a/doc/guide/hooks-ha.xml +++ b/doc/guide/hooks-ha.xml @@ -104,11 +104,11 @@ clock skew between the servers participating the HA setup would usually exist. This is acceptable as long as the clock skew is relatively low, comparing to the lease lifetimes. However, if the clock skew becomes too - high, the different notion of time for the lease expiration by different + high, the different notions of time for the lease expiration by different servers may cause the HA system to malfuction. For example, one server - may consider valid lease to be expired. As a consequence, the lease reclamation - process may remove a name associated with this lease from the DNS, even though - the lease may later get renewed by a client. + may consider a valid lease to be expired. As a consequence, the lease + reclamation process may remove a name associated with this lease from + the DNS, even though the lease may later get renewed by a client. Each active server monitors the clock skew by comparing its current time with the time returned by its partner in response to the heartbeat @@ -214,10 +214,10 @@ While in this state, the server will continue responding to the DHCP clients based on the HA mode selected (load balancing or hot standby), but the lease updates won't be exchanged and the - heartbeats won't be sent. The server which got into the - "terminated" state will remain in this state until it is - restarted. The administrator must eliminate the issue which caused - this situation prior to restarting the server (synchronize clocks). + heartbeats won't be sent. Once a server has entered the + "terminated" state it will remain in this state until it is + restarted. The administrator must correct the issue which caused + this situation prior to restarting the server (e.g. synchronize clocks). Otherwise, the server will return to the "terminated" state as soon as it finds that the clock skew is still too high. From c611a2ad1b8aaf8d27be2e7ba1e7959f9d629602 Mon Sep 17 00:00:00 2001 From: Marcin Siodelski Date: Thu, 24 May 2018 16:08:04 +0200 Subject: [PATCH 3/3] [5603] Added a note about restarting the HA service. The server needs to be restarted or reloaded. In the future we may provide a command. --- doc/guide/hooks-ha.xml | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/doc/guide/hooks-ha.xml b/doc/guide/hooks-ha.xml index 80bb7a696d..0b0a52df74 100644 --- a/doc/guide/hooks-ha.xml +++ b/doc/guide/hooks-ha.xml @@ -234,9 +234,16 @@ synchronize first. The secondary or standby server will remain in the waiting state until the primary synchronizes the database.. - + - Whether the server responds to the DHCP queries and which + + Currently, restarting the HA service being in the + terminated state requires restarting the + DHCP server or reloading its configuration. In the future, we will + provide a command to restart the HA service. + + + Whether the server responds to the DHCP queries and which queries it responds to is a matter of the server's state, if no administrative action is performed to configure the server otherwise. The following table provides the default behavior for