v2:
- Pass initial counter value to eventfd call
(can't pass flags here since they are obtained
with fcntl and must be restored same way or
restore will fail)
- Use rst_file_params for flags and owner restore
- Use eventfd.[ch] instead of eventfs.[ch]
- Move show funcs to eventfd.c
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
In case if dgram socket peer is not connected back
we can try to resolve peer by name.
For security reason this happens only if '-x' option
is passed at checkpoint and restore time.
In particular this is needed for programs which do
use dgram socket to send messages to /dev/log.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Completely unlinked file is the one with n_link count being zero.
Such files only allow to read their contents and carry with us.
In order to dump this thing I introduce the "path remap" technology.
For reg file a remapping entry is dumped which describes, that at
restore stage before opening a regfile->path this path should be
linked to some other name and then (after open) unlinked.
For completely unlinked files the remap path would be a path to
a "ghost" file, i.e. a file which is created only at the time of
restore and which is removed completely at the end of it.
Partially unlinked files (i.e. those having n_link != 0, but a
path by which we see them in someone's fd is not accessible) should
be handled in another way.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Just show implemented and stubs added to image
(regular file and pipes).
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
I store them on _entry since sids can only be inherited or
set to current's pid. Thus the best we can do it restore sids
at fork time, thus save them in the image we use to fork.
Maybe when we submit patches that will give us ability to set
arbitrary pgid and sid we'll change this, but this is in the
future.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
For regfiles this is done at open() time, for pipes thit is done with fcntl. Use
the same fcntl approach for sockets.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This bit is not per-file, but per-fd, thus put it on the fdinfo_entry.
Draing these bits from parasite together with the fds themselves, save
into image and restore with fcntl F_SETFD cmd where applicable.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Just dump their IDs and check they are not shared. For future.
IO and SEMUNDO is not there since tasks may have NO such objects
and currently we cannot detect whether they have them equal or
both don't have.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Now we store only real fdtable entries in this file, so it's
time to name the field properly and change type to u32.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The mm_xxx bits are per-mm_struct, not per-task_struct in kernel.
Thus, when we support CLONE_VM we'd better have these bits in a
separate image file.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Do not restore it yet -- the logic we're about to apply to
resolve tasks' paths relative to dumper/restorer is not yet
clear to me and it should better be hidden into a couple of
calls (dump_one_reg_file/open_fe_fd). But since we can't
chroot to fd we're about to expose the logic outside of the
open_fe_fd, which is not desirable ATM.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Why? Because one day we'll support various CLONE_ flags and
for fdtable and fs info we'd like to have separate images (since
these objects are separate in kernel).
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The regfile's ID of a VMA is stored in its shmid field. And the
file itself if sumped into regfiles.img image with 'special'-ly
generated ID (i.e. -- just allocate a new unique one).
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It was required before we switched to socketpair restore
scheme. Now it's not required, sockets just connect to
the peer they want to.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is a big change, yes. Dump unix sockets in the same manner
as all the other files are done now. A few notes however.
1. We explicitly drop names for connected stream sockets. This is
done to avoid conflicts with names -- accepted sockets share their
names with the listening parent. This can be done later by binding
a socket to a name, them renaming it to some temporary uniq one
and at the very very end renaming some back to original.
2. Interconnected sockets are restored via socketpair() call. This is
correct, but names are dropped. Need to bind() sockets after this
(yes, this can be done), but for this we need to implement the trick
with renames described before.
3. FD for socket queues is constantly re-opened not to resolve fd
conflicts. Need to use service fds engine for this later.
4. Some code cleanup is still required, yes (will follow shortly).
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
A pipe buffer has 16 slots. A slot is page, offset and size.
When we use splice and data is not aligned, splice connects
a page from file cache and set offset. For this reason we loose
a part of buffer.
If a data size is more than 15 pages, data will be aligned in a image.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Information about pipe's file structs saved in one global file and
fdinfo_entry is saved for each descriptor
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This was required when pages were stored in elf files for
exec. Now we can stop reading it on eof.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Now every inetsk fd dump results in a new entry in the fdinfo.img file. Sockets itself are
dumped into inetsk.img global image file. On restore the generic fdinfo redistribution algo
is used and inet sockets are opened only when required.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Since now on the fdinfo image only contains plain fdinfo_entry-es.
The tpye == FDINFO_REG files are described by regfiles.img entries
and are matched by te ID in both.
At dump stage each new ID generated results in a new entry in the
regfiles.img. At restore stage open_fe_fd should open a regfile by
the fdinfo's ID. Now this is done in suboptimal way, need to improve.
Show shows both images separately.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Make fdinfo_entry carry only the minimal info describing a file
descriptor -- the fd value itself, the fd type (regular file, exe
link, cwd, filemap and it will be pipes, sockets, inotifies, etc.)
and the describing file ID.
The mentioned ID will identify the type-d object, e.g. for regfiles
this ID is already generated with file-ids.c code.
The other part of this structure describes a regfile (i.e. a file
opened with open syscall). I put this new entry at the end of the
fdinfo_entry just to make the patching simpler. Soon this entry
will be dumped into its own file.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The namelen is u16, to cover the PATH_MAX u8 is not enough.
The pos is u64, since file offset is that long indeed.
The id is u32 as per previous patch.
Fix printf-s respectively.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is a precursor patch. Macro for max possible fd type will be required.
And it's easier to use enum in this case.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
The core image now contains only core per-task stuff.
The new file resurrects Tula magic number removed earlier.
Acked-by: Andrey Vagin <avagin@parallels.com>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
vma_entry contains shmid and all shared memory are dumped in own files.
The most interesting thing is restore.
A maping is restored by process with the smallest pid. The mamping
is created before executing restorer.
We map a full mapping and restore it's conten, then we open a file from
/proc/pid/map_files and store a descriptor in vma_info. The mapping is
unmaped. Now we can map any region of this mapping in the restorer.
We use this trick, because a target process may have this mapping in
some places and the restorer has not function to open proc files.
v2: fix error hangling
xemul: Fixed static-s and args for cr_dump_shmem
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
It will be used to restore shared mappings
v2: clean up
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
This is a cleanup patch. Use file entry type variable for special files
instead of file entry addr variable.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@openvz.org>
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
These are required for inet sockets, but were not added since listen
sockets do not have them.
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Move proc checks for Z-state into seize_task().
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
This patch was designed to be generic and thus usable for all kinds of
sockets. Not sure, thah this goal has been reached, but at least I tried.
Key ideas:
1) On-stack structure for collecting sockets queues and then passing them to
parasite code.
2) Singly linked list is used for collecting structures, representing sockets
of any kind (!) with queues.
Based on xemul@ patches.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
This patch adds sockets queue dump functionality. Key ideas
1) sockets info is passed as plain array in parasite args.
2) new socket option SO_PEEK_OFF with MSG_PEEK is used to read the get the
queue's packets.
3) Buffer for packet will be allocated for each socket separately and with
size of socket sending buffer. For stream sockets is means, that it's queue
will be dumped in chunks of this size.
Note: loop around sys_msgrcv() is required for DGRAM sockets - sys_msgrcv()
with MSG_PEEK will return only one packet.
Based on xemul@ patches.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
We switch generic-object-id concept with sys_kcmp approach,
which implies changes of image format a bit (and since it's
early time for project overall, we're allowed to).
In short -- previously every file descriptor had an ID
generated by a kernel and exported via procfs. If the
appropriate file descriptors were the same objects in
kernel memory -- the IDs did match up to bit. It allows
us to figure out which files were actually the identical
ones and should be restored in a special way.
Once sys_kcmp system call was merged into the kernel,
we've got a new opprotunity -- to use this syscall instead.
The syscall basically compares kernel objects and returns
ordered results suitable for objects sorting in a userspace.
For us it means -- we treat every file descriptor as a combination
of 'genid' and 'subid'. While 'genid' serves for fast comparison
between fds, the 'subid' is kind of a second key, which guarantees
uniqueness of genid+subid tuple over all file descritors found
in a process (or group of processes).
To be able to find and dump file descriptors in a single pass we
collect every fd into a global rbtree, where (!) each node might
become a root for a subtree as well.
The main tree carries only non-equal genid. If we find genid which
is already in tree, we need to make sure that it's either indeed
a duplicate or not. For this we use sys_kcmp syscall and if we
find that file descriptors are different -- we simply put new
fd into a subtree.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
This patch introduces the following changes:
1) introduces new flag VMA_AREA_SYSVIPC to mark corresponding vma entries.
2) enhance task /proc/<pid>/maps parsing to obtain first 5 letters of mapped
file. If device major file belong to ins equal to 0 (tmpfs) and it's name
starts with "/SYSV", then this mapping is considered as SYSV IPC and
corresponding vma entry status is updated with VMA_AREA_SYSVIPC flag.
3) omit dumping of mapping pages for SYSV IPC vmas.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
v2: New "MSG_STEAL" functionality is used
Signed-off-by: Stanislav Kinsbursky <skinsbursky@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
This name for the structure is obfuscating, because the structure
will be used also for queues and semaphores sets migration.
This patch renames this structure int ipc_desc_entry. It also renames
all related functions and prints to reflect structure name change.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
This patch removes collect stage and dumps tunables object right after
collect.
Signed-off-by: Stanislav Kinsbursky <skinsbursky@parallels.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
This patch adds ability to checkpoint/restore
/proc/pid/exe symlink, so if a process we've just
checkpointed has been say /path/to/exe, then at restore
time we bring this path back.
There some restiction from kernel side: if
existing /proc/pid/exe already mapped more than
once, the kernel will refuse to change the symlink,
so we need to restore it lately when mmaps of crtools
itself already unmapped (ie via late call in
restorer.c).
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Since we need to operate with sysctls pretty heavy,
better to add some common engine for all handlers.
Based-on-patch-from: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
This commit brings the former "Rewrite task/threads stopping engine"
commit back. Handling it separately is too complex so better try
to handle it in-place.
Note some tests might fault, it's expected.
---
Stopping tasks with STOP and proceeding with SEIZE is actually excessive --
the SEIZE if enough. Moreover, just killing a task with STOP is also racy,
since task should be given some time to come to sleep before its proc
can be parsed.
Rewrite all this code to SEIZE task and all its threads from the very beginning.
With this we can distinguish stopped task state and migrate it properly (not
supported now, need to implement).
This thing however has one BIG problem -- after we SEIZE-d a task we should
seize
it's threads, but we should do it in a loop -- reading /proc/pid/task and
seizing
them again and again, until the contents of this dir stops changing (not done
now).
Besides, after we seized a task and all its threads we cannot scan it's children
list once -- task can get reparented to init and any task's child can call clone
with CLONE_PARENT flag thus repopulating the children list of the already seized
task (not done also)
This patch is ugly, yes, but splitting it doesn't help to review it much, sorry
:(
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>