All data in a write buffer can be divided on two parts sent but not yet
acknowledged data and unsent data.
Currently the boundary between sent and unsent data is not dumped and
all the data are restored as if they have already been sent.
This methode can provoke long delays in tcp connection, because a kernel
can wait before retransmitting data.
https://bugzilla.openvz.org/show_bug.cgi?id=2808
The TCP stack must know which data have been sent, because
acknowledgment can be received for them. These data must be restored in
repair mode.
The second part of data have never been sent out, so they can be
restored without any tricks. These data can be sent into socket as
usual.
For restoring unsent data the repair mode is disabled for socket,
but it is enabled back after restoring data. It will be disabled
after unlocking network. In this case window probe is sent, which is
required for waknge the connection.
This patch fixes long delays in tcp connections after dumping and
restoring.
Thanks Pavel for the idea of disabling repair mode for restoring
unsent data.
https://bugzilla.openvz.org/show_bug.cgi?id=2808
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
TCP send queue contains two types of data:
* unacknowledged data, which have been sent out
* data, which have not been sent
Currently only the size of all data is save, but it's not enough for
proper restoring of TCP connections.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Currently if a network namespace is dumped and something fails, sockets
remain in repair mode. It's because cpt_unlock_tcp_connections is
executed only if network namespace is not dumped.
cpt_unlock_tcp_connections disables repair mode for sockets and drops
netfilters. netfilters are not used in case of network namespaces.
v2: don't execute network-unlock scripts, if network namespace are not
dumped.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The maximal size which may be used in the kernel for sending TCP data
on restore is varies depending on how many memory installed on the
system, moreover the memory allocated for "read queue" is bigger than
used for "write queue". Thus when we checkpointed a big slab of data
we need to figure out which size is allowed for sending data on restore.
For this we read /proc/sys/net/ipv4/tcp_[wmem|rmem] on restore and calculate
the size needed, then we simply chop data to segements and send it
in a loop.
Typical output on restore is something like
| (00.013001) 30110: TCP queue memory limits are 2097152:3145728
https://bugzilla.openvz.org/show_bug.cgi?id=2751
[xemul: moved stuff to kerndat.c]
Reported-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have generic do_pb_show() call and tons of show_foo
routines, that just call one with proper args. Compact
the code by putting the args into array and calling
the do_pb_show() in one place.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
tcp_repair_off implicitly modifies SO_REUSEADDR option
inside the kernel (thanks avagin@ for pointing this
feature out) thus if we are to rollback and restore
the former settings of socket -- don't forget to
repair this particular one.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's being reported that some systems (as Ubuntu 13.04) already
have struct tcp_repair_opt definition in their system headers.
| sk-tcp.c:25:8: error: redefinition of struct tcp_repair_opt
| sk-tcp.c:31:2: error: redeclaration of enumerator TCP_NO_QUEUE
So add a facility for compile time testing for reported entities
to be present on a system. For this we generate include/config.h
where all tested entries will lay and source code need to include
it only in places where really needed.
Reported-by: Vasily Averin <vvs@parallels.com>
Acked-by: Kir Kolyshkin <kir@openvz.org>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The fd in -> open callback is temporary (the files restoring
engine will re-open one under some other fd). But since we
add this fd to future repair off, this off will fail working
on wrong fd.
Move scheduling for repair-off into port-open where the corrent
fd is known.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
CID 996187 (#1 of 1): Resource leak (RESOURCE_LEAK)
10. leaked_storage: Variable "buf" going out of scope leaks the storage it points to.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
If a TCP socket will get live-migrated from one box to another the
timestamps (which are typically ON) will get screwed up -- the new
kernel will generate TS values that has nothing to do with what they
were on dump. The solution is to yet again fix the kernel and put a
"timestamp offset" on a socket.
v2: don't fail if TCP_TIMESTAMP is unsupported
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It's no longer required to use this option -- two currently
supported cases (tasks on host and tasks in containers) can
be detected automatically. Keep this option for future.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
TCP repair mode should be disabled after unlocking connections.
Disabling of repair mode drops SO_REUSEADDR, so it should be restored.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
* The following files goes into the directory arch/x86/include/asm unmodified:
- include/atomic.h,
- include/linkage.h,
- include/memcpy_64.h,
- include/types.h,
- include/bitops.h,
- pie/parasite-head-x86-64.S,
- include/processor-flags.h,
- include/syscall-x86-64.def.
* Changed include directives in the source files that include the headers
listed above.
* Modified build scripts to reflect the source moves.
Signed-off-by: Alexander Kartashov <alekskartashov@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
A socket can get fin and a state will be changed on CLOSE_WAIT,
which is not supported yet.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We have a window between getting info about tcp connections
and blocking them.
#2419
v2: clean upV
v3: don't update lengthes of queues for listen sockets,
they don't used.
v4: check that a state of a tcp connection is ESTABLISHED or CLOSE
v5: * don't check state, because it can be changed only on TCP_CLOSE.
In this case it will be changed again after restoring.
* refresh a socket after enabling the repair mode
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Sizes of send and recv buffers are set to maximum to restore queues,
after that sizes of buffers are restored.
O_NONBLOCK is set to a socket to prevent blocking during restore.
#2411
v2: do the same for unix sockets.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
We'll support other tcp states and udp-specific info eventually.
This introduced switch() looks more friendly to this future.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
rcv_wscale is a symetric parameter with snd_wscale.
Both this parameters are set on a connection handshake.
Without this value a remote window size can't be interpreted correctly,
because a value from a packet should be shifted on rcv_wscale.
This patch doesn't break a back compatibility, a rcv window
will be restored with the same bug (rcv_wscale = 0).
v2: Update to a new kernel interface:
[PATCH] tcp: restore rcv_wscale in a repair mode (v2)
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
One function is used on restoring and one is used on dumping,
so each function has own prefix rst or cpt.
The both functions have the same effect, so the main part of the names
is same and it describes "unlock_tcp_connections".
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
For another netns we don't need to lock separate connections,
an external chanel can be locked.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
TCP_REPAIR should be droppet when a network is unlocked.
A network should be unlocked at the last moment, because
after this moment restore must not failed, otherwise a state of
a tcp connection can be changed and a state of one side in our image
will be invalid.
v2: use xremalloc instead of mmap and remmap
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Use sys_setsockopt and declare a function in a header as "static inline"
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
A disabling repair mode drops SO_REUSEADDR.
We can set SO_REUSEADDR after disabling repair mode, but
a small race window exists in this case.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The pb_read thing is no longer a macros. This will allow to
factor out objects collecting on restore.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It was accidentally broken by 424a4adb6. It's better to use sk ino
instead of sk id, since tcp connection may be unbound from any fds
(not supported now) and thus there may be no ID for those.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Use fdtype_ops facility to c/r inet sockets.
v2:
- Use BUG_ON if socket is attempted to be dumped
several times
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The task is not complete - this is just a part of what have to be done. I.e.
looks like a lot of excessive deps can be fixed.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>