2
0
mirror of https://github.com/openvswitch/ovs synced 2025-08-22 01:51:26 +00:00
Ilya Maximets 6fc5221742 ipsec: libreswan: Fix premature reconciliation of just added tunnels.
Currently we're only tracking the last refresh time and perform
reconciliation of non-active connections on every refresh.  This is
causing issues in large clusters when tunnels are added sequentially.
Consider the following example:

 1. Tun-1 added -> refresh()
    -> Tun-1: adding 'in' and starting 'out'.

 2. Tun-2 added -> refresh()
    -> Tun-2: adding 'in' and starting 'out'.
    -> Tun-1: The other side didn't have time to initiate the 'in'
              connection yet, so it is not active.  But we see that
              it's not active and trying to start it.

 3. Tun-3 added -> refresh()
    -> Tun-3: adding 'in' and starting 'out'.
    -> Tun-2: The other side didn't have time to initiate the 'in'
              connection yet, so it is not active.  But we see that
              it's not active and trying to start it.
    -> Tun-1: The connection still had no time to become active, but
              we declare it 'defunct' and re-creating.

Behavior above is specific to Libreswan 4.  Libreswan 5 will report
UP connections as active in most cases, so they will not be marked
as defunct, but they will still be started quickly after addition
when it is not needed.

This creates unnecessary churn in the cluster and puts Libreswan into
an uncomfortable position where crossing stream issues (where both
sides are trying to establish the same connection at the same time)
are far more likely.

Fix that by specifically tracking time when we add or start each
connection instead of just the last time we refreshed for any reason.
This should make ovs-monitor-ipsec to actually wait for the
reconciliation interval before attempting to repair connections and
give Libreswan a decent amount of time to process the changes and try
to establish connections normally.

Note: even though we could precisely track 15 seconds for each
individual connection and wake up when exactly 15 seconds expire,
we're not doing that in this patch.  The reason is that we still
need to wake up every 15 seconds to check that all the previously
active connections are still active, and doing that allows for
refreshing many connections in the same run instead of waking up
every second just for one connection.

Fixes: 25a301822e0d ("ipsec: libreswan: Reconcile missing connections periodically.")
Reported-at: https://issues.redhat.com/browse/FDP-1364
Acked-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
2025-05-16 23:29:52 +02:00
..
2018-11-12 08:38:51 -08:00