Stopping tasks with STOP and proceeding with SEIZE is actually excessive --
the SEIZE if enough. Moreover, just killing a task with STOP is also racy,
since task should be given some time to come to sleep before its proc
can be parsed.
Rewrite all this code to SEIZE task and all its threads from the very beginning.
With this we can distinguish stopped task state and migrate it properly (not
supported now, need to implement).
This thing however has one BIG problem -- after we SEIZE-d a task we should seize
it's threads, but we should do it in a loop -- reading /proc/pid/task and seizing
them again and again, until the contents of this dir stops changing (not done now).
Besides, after we seized a task and all its threads we cannot scan it's children
list once -- task can get reparented to init and any task's child can call clone
with CLONE_PARENT flag thus repopulating the children list of the already seized
task (not done also)
This patch is ugly, yes, but splitting it doesn't help to review it much, sorry :(
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
This eliminate
| ipc_ns.c:287:2: error: dereferencing type-punned pointer will break strict-aliasing rules [-Werror=strict-aliasing]
and makes code simplier.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Dumping is simple. All but secbits can be read from proc, secbits
are got from parasite.
Restoring is a bit tricky -- when you change anything on kernel
cred's struct it performs sophisticated checks and can change
some more stuff than requested, so the creds restoration procedure
is carefully commented step-by-step.
Another thing to mention is that creds are restored after everything
else, i.e. right before performing final threads sync and sigreturns.
This is done to avoid potential problems with insufficient caps for
restoring other stuff (e.g. CAP_DAC_OVERRIDE or zero euid is most
likely required for opening any image file and the notorious control
/proc/sys/kernel/ns_last_pid, which in turn is performed till the
very last moment).
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
It's done in two steps
- On checkpoint we find which icons are present
over all sockets and setup peer number to
appropriate listening socket
- On restore we collect listening sockets and once
we find in-flight connection we search for appropriate
listening socket name and use it to call connect() then
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
This one is actually an internal kernel magic number for pipefs filesystem
and shouldn't be changed.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Existing ones are boring. Let's switch them into geographical coordinates
of various Russian towns in NNNNEEEE form.
4 digits for a coordinate give us up to 2km of inaccuracy, which is more
than enough to find a town. We cannot use longitude further than 99.99,
i.e. we won't cover the Far East region, but that's OK -- there's more than
enough good candidates even in the European part of the country only.
Feel free to extend.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Only two fields are modifiable -- hostname and domainname. So
read them on dump and write on restore.
File format is simple --
u32 magic
u32 length of nodename
u8[] nodename string
u32 length of domainname
u8[] domainname string
For OpenVZ we can write the release at the end, but this is later.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Timers are dumped from inside parasite code, the format is plain -- just
3 pairs of interval/value one-by-one.
The restoration occurs in two stages -- first prepare the timer values in
restorer (and check for sanity), then setup the timers in the latest stage
before actually calling the sigreturn.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Introduce 3 states we will have to work with:
* alive for tasks sleeping or running
* dead for zombies
* stopped for stopped tasks. We cannot distinguish tasks in this state now,
but with freezer cgroup this will become possible
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Keep task arch-independent fields in one struct (will be extended) in the
beginning of the image and make pads be located separately.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Some process can share one struct file-s, we may find them by "object IDs".
A file descriptor is opened in one process and send to other via unix socket.
The procedure of restoring files contains four stages.
* Collect data about all file's descriptors
On this stage we find process which will restore a file descriptor and
create a list of processes, who should get this descriptor.
* Create datagrams unix sockets
If a file descriptor should be received, a unix socket is created
instead of it.
* Open file descriptors
A process with the least pid opens a file and sends this file
descriptors to all one who wait it.
* Receive file descriptors.
When we were thinking up this algoritm, we wanted to minimize a number
of context switches. A number of context switches is proportional of a
number of processes.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Currently it can only work with stream sockets, which have no skbs in queues
(listening or established -- both work OK).
The cpt part uses the sock_diag engine that was merged to Dave recently to
collect sockets. Then it dumps sockets by checking the filesystem ID of a
failed-to-open through /proc/pid/fd descriptors (sockets do not allow for
such tricks with opens through proc) against SOCKFS_TYPE.
The rst part is more tricky. Listen sockets are just restored, this is simple.
Connected sockets are restored like this:
1. One end establishes a listening anon socket at the desired descriptor;
2. The other end just creates a socket at the desired descriptor;
3. All sockets, that are to be connect()-ed call connect. Unix sockets
do not block connect() till the accept() time and thus we continue with...
4. ... all listening sockets call accept() and ... dup2 the new fd into the
accepting end.
There's a problem with this approach -- socket names are not preserved, but
looking into our OpenVZ implementation I think this is OK for existing apps.
What should be done next is:
1. Need to merge the file IDs patches in our tree and make Andrey to
support files sharing. This will solve the
sk = socket();
fork();
case. Currently it simply doesn't work :(
2. Need to add support for DGRAM sockets -- I wrote comment how to do it
in the can_dump_unix_sk()
3. Need to add support for in-flight connections
4. Implement support for UDP sockets (quite simple)
5. Implement support for listening TCP sockets (also not very complex)
6. Implement support for connected TCP scokets (hard one, Tejun's patches are not
very good for this from my POV)
Cyrill, plz, apply this patch and put the above descriptions onto wiki docs (do we
have the plans page yet?).
Andrey, plz, take care of unix sockets tests in zdtm. Most likely it won't work till
you do the shared files support for sockets.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Since we use pure syscalls there is no
need to keep intermediate layer for signals.
Moreover mask entry moved at the end of the structure
so we will easily expand it if it'll be ever needed.
Note it breaks backward compatibility with older image
but since it's development stage it should be safe.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Andrew Vagin <avagin@parallels.com>
Since we operate with syscalls directly we are
to convert signal's structures between image and
kernel formats, without intermediate glibc layer.
Note this involves chaging sa_entry::flags to u64
(since it's long int value in kernel).
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
It's needed to keep singnal handlers on
disk with predefined format.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
A parasite code dumps all sigactions in sigact.pid.
v2: remove hard code for sizeof(sigset_t)
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
It has been used at very early stage when
no mincore call was implemented. Not needed
anymore -- so drop it out.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Reduce the pages-xxx.img file size significantly (from 2.1M to ~100K for simple counter test)
by not dumping private file pages, that have not yet changed from its file prototype.
If you'll have problems with it, just let me know and comment the definition of PAGE_ANON not
to block your work.
This uses the implemented earlier flag from mincore.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
This one isn't used on restore process, since the file mapped is
stored in the fdinfo part of the images.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
It's pointless. All vmas are stored in the per-pid image file.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
No need to keep it that big. Note from
this patch if we ever deside to use kernel
elf approach -- the image structures are
to be updated in kernel as well.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>
Also re-make image to be 2 pages in size
which should be enough for basic params we
need to restore tasks.
Signed-off-by: Cyrill Gorcunov <gorcunov@gmail.com>