mirror of
https://github.com/openvswitch/ovs
synced 2025-08-22 01:51:26 +00:00
Currently we're only tracking the last refresh time and perform reconciliation of non-active connections on every refresh. This is causing issues in large clusters when tunnels are added sequentially. Consider the following example: 1. Tun-1 added -> refresh() -> Tun-1: adding 'in' and starting 'out'. 2. Tun-2 added -> refresh() -> Tun-2: adding 'in' and starting 'out'. -> Tun-1: The other side didn't have time to initiate the 'in' connection yet, so it is not active. But we see that it's not active and trying to start it. 3. Tun-3 added -> refresh() -> Tun-3: adding 'in' and starting 'out'. -> Tun-2: The other side didn't have time to initiate the 'in' connection yet, so it is not active. But we see that it's not active and trying to start it. -> Tun-1: The connection still had no time to become active, but we declare it 'defunct' and re-creating. Behavior above is specific to Libreswan 4. Libreswan 5 will report UP connections as active in most cases, so they will not be marked as defunct, but they will still be started quickly after addition when it is not needed. This creates unnecessary churn in the cluster and puts Libreswan into an uncomfortable position where crossing stream issues (where both sides are trying to establish the same connection at the same time) are far more likely. Fix that by specifically tracking time when we add or start each connection instead of just the last time we refreshed for any reason. This should make ovs-monitor-ipsec to actually wait for the reconciliation interval before attempting to repair connections and give Libreswan a decent amount of time to process the changes and try to establish connections normally. Note: even though we could precisely track 15 seconds for each individual connection and wake up when exactly 15 seconds expire, we're not doing that in this patch. The reason is that we still need to wake up every 15 seconds to check that all the previously active connections are still active, and doing that allows for refreshing many connections in the same run instead of waking up every second just for one connection. Fixes: 25a301822e0d ("ipsec: libreswan: Reconcile missing connections periodically.") Reported-at: https://issues.redhat.com/browse/FDP-1364 Acked-by: Eelco Chaudron <echaudro@redhat.com> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>