2
0
mirror of https://github.com/checkpoint-restore/criu synced 2025-08-30 05:48:05 +00:00

11124 Commits

Author SHA1 Message Date
Pavel Tikhomirov
f2bf6597ca path: simplify mnt_get_sibling_path via get_relative_path
Previous code did:

1) get rpath: mount's mountpoint relative to it's parent mountpoint
2) get cut_root: parent's root relative to parent's slave root or vice
versa (will be "-" if parents root is wider of "+" if thicker)
3) return parent's slave mountpoint +/- cut_root + rpath

It can be done more robust with get_relative_path:

1) get rpath: mount's mountpoint relative to it's parent mountpoint
2) get fsrpath: add rpath to parent's root (path relative to fs root)
3) get rpath: fsrpath relative to parent's slave root
4) return parent's slave mountpoint + rpath

In the latter approach we do not need to open code workarounds for
consequent slashes in paths (get_relative_path would do this for us),
and we also do not need to have complex logic with +/-.

While on it let's also switch ->mountpoint to ->ns_mountpoint where
possible, as mountpoint can have unexpected prefixes.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/0fd09f8571

Changes: rework mnt_get_sibling_path more.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
abbc70adc9 mount: use ns_mountpoint for children-overmount check
We need to skip root_yard_mp parent as it has no ns_mountpoint, it also
has no children overmounts so we are safe, all others can be compared by
ns_mountpoints.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/e5665c976

Changes: add mi->parent pre-check, reword commit message.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
c17695cb11 mount: use ns_mountpoint in root_path_from_parent
Fail root_path_from_parent if parent is root_yard, we want to only
lookup root path in real parent mounts.

Now it is safe to use ns_mountpoint instead of mountpoint as both
children and parent have it and they are relative.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/e58a91883

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
010295b8fa mount: use ns_mountpoint in validate_children_collision
Function validate_children_collision is both called on dump and on
restore. On dump mountpoint and ns_mountpoint are the same. On restore
as we never call validate_children_collision on helper mounts
(root_yard_mp and cr_time are not in mntinfo list), for all other mounts
strcmp results would be the same with mountpoint and ns_mountpoint.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/8f4fda5ac

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
07eb01593d mount: skip root yard children from mnt_needs_remap check
There is no point of remaping ns root mounts they can't overmount anybody.

This also allows us to switch mnt_needs_remap from ->mountpoint to
->ns_mountpoint for mount comparison in overmount detection.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/9475bf843

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
e8de10a4fb mount: use ns_mountpoint in mnt_is_overmounted
Let's use ->ns_mountpoint in comparison as ->mountpoint can change (e.g.
see how we add ns root in get_mp_mountpoint and in do_remap_mount we can
change it again). We plan to get rid of ->mountpoint everywhere where we
can use unchanged ->ns_mountpoint.

Cherry-picked hunks from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/e98e1456d

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
b954e51360 autofs: use ns_mountpoint in autofs_create_dentries
Replace ->mountpoint with ->ns_mountpoint for determining relations
between mounts.

Also let's use get_relative_path in autofs_create_dentries as it is more
robust, before that we've missed the case where mountpoint of child of
autofs mount is multilevel subdirectory of parent mountpoint, and always
created them as single level subdirectory.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/5d5462202

Changes: skip children overmount as it does not need a subdirectory.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
7a67949e55 mount: make general place for shared variables on mount-info on restore
Put remounted_rw to it. This allows us to easily add some more of such
variables without allocating each one of them separately.

Due to existance of shfree_last shmalloc'ed region can be inherited from
the previous caller so it needs to be explicitly zero initialized.

Fixes: 0a2d380e6 ("ghost/mount: allocate remounted_rw in shmem to get
info from other processes")

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/6750e5793

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
0c41c1187b mount: fix broken remounted_rw check
Expression (x && REMOUNTED_RW) is always same as just (x).

It should've been (x & REMOUNTED_RW) to check if mount is marked as
temporary remounted writable and requires to be switched back.

By fixing this check we eliminate excess readonly remounts.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/167f8ac67

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
7182470454 mount: move root yard tree merge as early as possible
Let's merge mount trees under root_yard just after reading from image.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/8e8ecdfdc

Changes: split only root yard part as a separate patch, and put root
yard alloc into merge_mount_trees.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
770cdbfb9f mount: prepare is_overmounted as early as possible
Function mnt_is_overmounted is designed to detect if mount is overmounted in
current tree using comparison of mountpoints of neighbour mounts for detection.
We want to get actual overmounts in dumped tree, we don't expect that helper
mounts we add or merging will introduce new overmounts. So let's do overmount
detection earlier before adding helpers.

Set is_overmounted = false for root yard and binfmt helper mounts.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/e98e1456d

Changes: rename set_is_overmounted to prepare_is_overmounted, move it
just after collecting mounts from images to mount tree, handle helper
mounts.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
83bbf1b051 mount: add helper mnt_get_external_bind_nodev
Will use it to find shared mount we can bind from and also can inherit
external slavery. Device-external can't give us external slavery.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/dcd952c4c

Changes: switch to mnt_bind_pick helper.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
0fd0e03a21 mount: do not override master_id to -1 for root binds
There is no point to lose this information, having -1 everywhere in
mount images instead of acutall master id can be confusing.

Note that now need_master is true for bindmounts of root mounts with
same master_id as root mount, so now they are handled with a common
code, we've added can_receive_master_from_root check specially to handle
this case right. Also note that in propagate_mount we no more set ->bind
for this case, this is handled by mnt_ext_slave list related code.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/b3c9dc05e

Stripped only master_id relative part of original patch, add
preparational patches before this one.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
4f156f32ba mount: put external slavery mounts to separate mnt_ext_slave list
We need to put mounts which need to inherit master_id from external
mounts or from root mount into separate list, so that we can set ->bind
on them right in propagate_siblings.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/ea592cf6e

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
ef79912c1d mount: add can_receive_master_from_root helper
If mount has external master_id it can inherit it as a bind of external
mount, but also it can inherit it as a bind of container root mount, so
let's add similar condition to allow such mounts.

Note: need_master is false for binds of root mount which can inherit
master_id from root mounts yet, this would change in next patch.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
b52fcb284a mount: replace CRTIME_MNT_ID with HELPER_MNT_ID
Root yard mount also has mnt_id == 0 so it will look better with a new
name. Let's explicitly initialize root yard mnt_id to HELPER_MNT_ID
for the sake of code readability.

Also in near future we might want to create additional mount helpers to support
mounts in CT with no fsroot mounted.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/45bf6f0ee

Changes: split umount hunk to previous patch, set HELPER_MNT_ID for root
yard.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
4736a7240e mount/restore: leave ns_mountpoint NULL for aux binfmt_misc mount
On dump, yes, mountpoint and ns_mountpoint are the same, but on restore
they don't and puting something like "<root_yard>/binfmt_misc" to
ns_mountpoint is wrong, let's leave ns_mountpoint NULL, this mount
should not be compared by ns_mountpoint with other mounts anyway.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
16085b5e67 mount/restore: create auxiliary binfmt_misc mount in the root yard
Put our auxiliary binfmt_misc mount in "<root_yard>/binfmt_misc" instead
of "<root_yard>/<mntns>/proc/sys/fs/binfmt_misc". Thus we can restore
binfmt_misc without altering actual mount tree, which looks much more
safe.

For that we need to remove "fake top mount_info" handling from
add_cr_time_mount as now we intentionally add binfmt_misc mount as a
child of ("fake") root yard. On dump this does not change anything.
Also we need to create mountpoint for binfmt_misc in root yard.

As now mount is out of restored mount tree we don't need to umount it,
so remove corresponding CRTIME_MNT_ID umount hunk in do_new_mount.

Note: to make binfmt_misc c/r work criu should be compiled with
CONFIG_BINFMT_MISC_VIRTUALIZED and binfmt_misc should be actually
virtualized and this is only done in Virtuozzo kernel per ve.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/2eb535843
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/d79c7f441
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/34002bef4
Cherry-picked one hunk from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/45bf6f0ee

Changes: merge all fixups together to one consistent patch.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
a379d4d945 zdtm: add mntns_pivot_root_ro test
This checks that superblock readonly flag is applied to nested mntns
roots on restore.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
2a3d2bc284 mount: apply superblock flags to nested ns roots
Before this change we didn't apply sb-flags if we mount the root mount of
non-root mntns. There is no point in it, if we got to do_new_mount this root
mount is not external bind, so we won't change sb-flags on host if we change it
for this mount. So we just loose sb-flags on some regular container mount for
no reason. Fix it.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/e7ffe4c60

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
77f67973f2 zdtm: add mntns_pivot_root test
This creates nested mntns and does pivot_root to tmpfs mount, so that
roots of original test mntns and in nested mntns are different.

Before allowing nested mntnses with different roots in previous patch
this would fail.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
2fdb4993a0 mount: allow nested mount namespaces with different roots
Only root in root-mntns is special (see rst_mnt_is_root) all other
mounts are mounted regulary there is no difference between ns root and
any other mount or bind-mount.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/f41e41dd5

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
cf6fe2d48b mount: add mnt_is_root_bind helper
Helper mnt_is_root_bind indicates that mount can be bind-mounted from
the root mount (which in it's turn from opts.root).

Use it in validate_mounts: we should skip unsupported mount from fsroot check
if we know it will be bindmounted from root mount, is_ns_root check was wrong.

Also fix root mount check in dump_one_fs, root mounts in non root mntns should
be dumped normally if they are not bind-mounts of root mount.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/25d078971

Changes: switch to mnt_bind_pick helper, export to mount.h, also add
mnt_get_root_bind helper for future use in mount-v2, remove excess root
yard hunk.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
e50abbd3b3 zdtm: add mnt_ext_collision test
This test creates two mount namespaces, one "root" with external mount
at /mnt_ext_collision.test/dst and one "nested" with different internal
mount at /mnt_ext_collision.test/dst instead.

This case is important for nested containers, if we dump a container
with some external mount in /mnt we should not also replace mounts in
/mnt for nested containers with the external one. (One example is docker
containers inside Virtuozzo containers.)

Without previous patch which restricts external mounts resolution to
only root mntns of container this test fails as internal mount is
replaced by external one after migration.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
a963ceb770 mount: restrict mp-external mount map to init container mntns only
We resolve mountpoint-external mounts on dump by mountpoint comparison,
so if we have other mount (other superblock e.g. in nested mntns) with
same mountpoint we would also resolve this mount as external and restore
it as external: replacing it completely with different mount... That's
wrong, so to make this interface more robust let's only resolve
mountpoint-external mounts in root mntns of container, not in all
mntnses as it was before.

Note: if actual external mount (bind of external) gets to nested mntns
it's ok not to resolve it as external as criu would bind it from the
resolved mount in root mntns. So external mounts in nested mntns are
still supported after this patch.

Cherry-picked one hunk from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/034498b28

Changes: apply mntns check only to mountpoint-external mounts.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
007501f985 zdtm: add new mnt_ext_root test
This test simply creates a) root external mount and b) "deeper"
bindmount for it (deeper in terms of mnt_depth). Our mount restore code
tries to mount (b) first and fails (without previous patch ordering
external mounts before their binds).

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/d31954669

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
4f94149346 mount: mount external mount before mounting it's binds
The problem when we don't order these mounts we can get to mounting
non-external bind first via do_new_mount and fail c/r. For instance for
tmpfs we would fail on no image to get contents from. See the test
mnt_ext_root for more info.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/baf3f8db8

Changes: switch to mnt_bind_pick helper, export to mount.h, make check
in can_mount_now skip mounts with ->bind set.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
d5cb7764e4 mount: show more info about why we can't mount
Currently if we have mount deadlock it is hard to understand which
mounts lock each other, these makes it easier.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/8ba8499e2
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/48d044ae11

Changes: merge newline fixup.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
685a53eeca mount: rework skipping external mounts in dump_one_mountpoint
Function dump_one_fs already has mnt_is_external_bind check inside, so
there is no point to check pm->external one more time.

Function check_bindmount is intended to check devpts bindmount's master
was opened in right mount namespace, but if bindmount is external mount
there is no point to check this. Let's also skip check for bindmounts of
external mounts.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
3b2b808128 mount: split mnt_is_external(_bind) and can_receive_master_from_external
We use mnt_is_external():

1) In validate_mounts() to skip fsroot existence check for mounts which
will be bind-mounted from external mounts.

2) In resolve_shared_mounts() to skip error on slave mounts without
master mount, if they can receive these master_id through external
mount.

3) In dump_one_fs to skip dump of mounts which will be bind-mounted from
external mounts.

Cases (1) and (3) are the same, but case (2) is quiet different. Lets
split these cases thus making things simplier.

Effectively these patch does not change criu's behaviour at all. While
I can't say that old mnt_is_external was wrong, it was too complex and
hard for understanding, so it's worth to switch to lookup across
bindmounts list via general mnt_bind_pick() helper. And now when it is
obvious that mnt_is_external looks for external bindmount, let's also
change it's name to mnt_is_external_bind.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/494b52ba8

Changes: use mnt_bind_pick helper, use is_sub_path helper to be more
robust, rename mnt_is_external to mnt_is_external_bind, fix
clang-format, export to mount.h, use mnt_is_nodev_external as we can not
inherit master from device-external mounts.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
c09bd89419 mount: add mnt_bind_pick helper to pick the desired bind
Adding different pick functions we would be able to search different
things like mounted bind with wider root, or external bind, or external
bind with same sharing group and so on and so forth.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
9d1f39f28a unittest: add some tests for get_relative_path helper
v2: let's also mock kerndat_s.sysctl_nr_open field, to make aarch64
clang ci happy

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
97bd9511ca util: add get_relative_path helper
This is a smart way of getting relative paths:

1) Always returns relative path, no unexpected starting '/';
2) Detects subpath even if path formats are different, only real directory
and file names matter;
3) No path modiffication/allocation, returns shifted pointer to the
orignal path.

We have many places where we need to cut subpath from path. Different code
blocks doing this job spread widely across the codebase for instance see:
cut_root_for_bind and root_path_from_parent. But those implementations rely on
the fact that subpath's and path's formats are the same.

When we modify or concatenate paths we can accidentally get strange
path formats, paths given by user can have strange format, and the job
to manually maintain all paths in "simple" format everywhere is too
hard. So let's just add a tool to compare "strange" paths.

E.g.:

get_relative_path("./a////.///./b//././c", "///./a/b") == "c"

Note: ".." in path is not supported, and we just can't support it right
without full filesystem tree information to resolve paths like
"../../a", so we just treat ".." as a directory name which should work
in simple cases.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/73a771348

Changes: add other useful robust path comparison helpers is_sub_path and
is_same_path based on get_relative_path, fix clang-format.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
261b7a8fd4 mount: setup mnt_bind list before using it in mnt_is_external
Before this patch mnt_is_external() used non-populated mnt_bind list
when called from resolve_shared_mounts(), thus it could work not as
intended.

Let's add separate helper search_bindmounts() for populating mnt_bind
list, and add mnt_bind_is_populated to differentiate between
non-populated list and just empty populated list. This way we can add a
BUG_ON to mnt_is_external to catch such order problems in future.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/e464c1c6d
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/8b22b30d5
Cherry-picked one hunk from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/ca9de41e3

Changes: simplify commit message, merge fixups: search bindmounts
earlier so that we have bindmounts info as early as possible, rename
mnt_no_bind to mnt_bind_is_populated and simplify it's logic a bit.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Alexander Mikhalitsyn
30261a7515 mount: skip fstype and source checks for external mounts in mounts_sb_equal
Fstype and source fields can be changed by resolve_external_mounts() or
by try_resolve_ext_mount() for external mounts, but we can have other
mounts from same superblock which are not detected as external, for
instance bind of subdirectory from device-external or bind of
mountpoint-external mount to other mountpoint. So we need to still be
able to find bindmounts between mounts with changed fstype or source and
unchanged mounts.

So let's make fstype/source checks in mounts_sb_equal ignored for
external mounts. Leave only fstype->sb_equal checks if have them.

Signed-off-by: Alexander Mikhalitsyn (Virtuozzo) <alexander@mihalicyn.com>

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/fadc38d84
Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/f9700cb12

Changes: merge two commits in one and rework, remove ":)", reword
commit-message to make patch self-sufficient.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
8d5300aa9b mount: mark mounts of external devices external
Previously only autodetected and mountpoint external mounts had
mount_info->external field set, let's fix this injustice so that we can
operate all external mounts in a similar manner.

Also:

Print info message when device external mount is detected similar to
mountpoint external mounts detection.

Add helper mnt_is_nodev_external to let do_mount_one, can_mount_now and
do_bind_mount handle device external mounts separately as it was before.

Handle device external mount right in get_mp_root to set ->external on
restore. (note: calling ext_mount_lookup is only meaningfull for
mountpoint external mounts)

Add helper mnt_is_dev_external to use in resolve_source to make it more
clear that it is a device external mount restore path.

All other "if (mi->external)" checks now also handle device external
mounts, but they all look safe to do so and could've done it initially,
here is a list: fusectl_dump, mnt_is_external, dump_one_mountpoint,
propagate_mount.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/afd899539

Changes: cleanup commit message, add some helpers.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
e17c1cc128 mount: do not detect non-fsroot mounts as device-external
Device-external mounts are restored via do_new_mount(), but function
do_new_mount only allows creating mounts with root "/", as it does
simple mount (not bind) without any later root change. Restoring
non-root mounts via do_new_mount is just imposible.

So let's detect mounts as device-external only when they have fsroot
root, all other non-fsroot binds of this device would be restored as
bindmounts of fsroot ones.

This is a cosmetic change as though non-root mounts were detected as
device-external before this patch they anyway would not be created with
do_new_mount() because of fsroot/bind check in can_mount_now orders them
to be restored as binds.

Cherry-picked one hunk from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/afd899539

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Alexander Mikhalitsyn
eda1e5fdbd mount: add mntinfo_add_list_before helper for adding to mntinfo list
Use this helper everywhere instead of manually adding mounts to the head
of the list, this way it is much easier to track all places where we do
add to mntinfo list.

Signed-off-by: Alexander Mikhalitsyn (Virtuozzo) <alexander@mihalicyn.com>

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/7bca9397b

Changes: skip hunk adding root_yard_mp to the list because root yard has
not fully initialized mountinfo structure (can break code which uses
mntinfo fallback in lookup_nsid_by_mnt_id), let's only have real mounts
in mntinfo list. Also skip cr_time mount from mntinfo list for the same
reason.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
9649356e3a zdtm: fix mnt_ext_master test to correspond to it's name
Before these change the on-host-"zdtm_auto_ext_mnt" mount with
mountpoint "/tmp/zdtm_ext_auto.XXXXXX" was private/shared depending on
it's parent mount "/tmp". And e.g. on my setup the parent mount on
"/tmp" is private and our "host" mount becomes private too. So
in-container-"zdtm_auto_ext_mnt" external mount is also private but test
name hints it should be slave.

E.g. If I ran mnt_ext_master before this patch, in mnt_ext_master
process mntns we see that our "external" mount is private but not slave:

[root@fedora criu]# grep zdtm_auto_ext_mnt /proc/167077/mountinfo
1239 1238 0:138 /test /ext_mounts rw,relatime - tmpfs zdtm_auto_ext_mnt rw,seclabel,inode64

After this patch:

[root@fedora criu]# grep zdtm_auto_ext_mnt /proc/166385/mountinfo
1239 1238 0:138 /test /ext_mounts rw,relatime master:413 - tmpfs zdtm_auto_ext_mnt rw,seclabel,inode64
                                              ^^^^^^^^^^

So we just explicitly make on-host-"zdtm_auto_ext_mnt" shared, and this
makes in-container-"zdtm_auto_ext_mnt" external mount slave.

Cherry-picked from Virtuozzo criu:
https://src.openvz.org/projects/OVZ/repos/criu/commits/a1a221fe9

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
5a8fd343f5 uffd: fix __u64 print format specifier
coverity CID 389197:

CID 389197 (#1 of 1): Invalid printf format string (PRINTF_ARGS)
format_error: Length modifier L not applicable to conversion specifier in %Lu. [show details]
284 pr_err("Incompatible uffd API: expected %Lu, got %Lu\n", UFFD_API, uffdio_api.api);

Looking on C11 standard it seems that "%Lu" is undefined, we better not
use this, see:

"L Specifies that a following a, A, e, E, f, F, g, or G conversion
specifier applies to a long double argument."
http://port70.net/~nsz/c/c11/n1570.html#7.21.6.1p7

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
Pavel Tikhomirov
9e74735160 sk-unix: fix e_str leak in unix_sk_id_add
coverity CID 389191:

int unix_sk_id_add(unsigned int ino)
2327{
2328        char *e_str;
2329
    1. alloc_fn: Storage is returned from allocation function malloc.
    2. var_assign: Assigning: ___p = storage returned from malloc(20UL).
    3. Condition !___p, taking false branch.
    4. leaked_storage: Variable ___p going out of scope leaks the storage it points to.
    5. var_assign: Assigning: e_str = ({...; ___p;}).
2330        e_str = xmalloc(20);
    6. Condition !e_str, taking false branch.
2331        if (!e_str)
2332                return -1;
    7. noescape: Resource e_str is not freed or pointed-to in snprintf.
2333        snprintf(e_str, 20, "unix[%u]", ino);
    8. noescape: Resource e_str is not freed or pointed-to in add_external. [show details]
    CID 389191 (#1 of 1): Resource leak (RESOURCE_LEAK)9. leaked_storage: Variable e_str going out of scope leaks the storage it points to.
2334        return add_external(e_str);
2335}

We should free e_str string after we finish it's use in unix_sk_id_add,
easiest way to do it is to use cleanup_free attribute.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
87d3735145 criu/plugin: Add support for criu image streamer
Modifications to support criu image streamer when using amdgpu_plugin.
When running with criu image streamer, fseek/lseek is not available so
we store the file size in the first 8-bytes of the actual file.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
55370b720e criu/plugin: Store BO contents directly to file
Store BO contents directly to file (1 per GPU) instead of using
protobuf.

Bug Fix:
Fixes an issue where we could not handle BOs bigger than 4GB because
protobuf has an internal limit of 4GB for the Bytes structure.

Performance Improvements:
This significantly reduces CR duration on multi-GPU systems as it allows
reading and writing to disk in parallel. During checkpoint, instead of
waiting for all the BO contents to be read from the one protobuf file,
we can now start writing the BO contents as soon as the first BO is read
from disk. During restore, we can start writing BO contents to disk
after the first BO from VRAM. This also reduces the peak amount of
system memory used as we only need to keep 1 BO content in memory per
GPU at a time instead of all the BO contents.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00
Felix Kuehling
ecdf740fa3 criu/plugin: Add whitepaper document
Adding whitepaper document

Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00
Rajneesh Bhardwaj
99a2380fc0 criu/plugin: Dockerfile for amdgpu_plugin
This sets up the pytorch environment for BERT Transformers and also sets
up CRIU along with all its dependencies including amdgpu plugin for
supporting CR with AMDGPUs.

Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
2095de9f03 criu/plugin: Fix for FDs not allowed to mmap
On newer kernel's (> 5.13), KFD & DRM drivers will only allow the
/dev/renderD* file descriptors that were used during the CRIU_RESTORE
ioctl when calling mmap for the vma's.
During restore, after opening /dev/renderD*, amdgpu_plugin keeps the
FDs opened and instead returns a copy of the FDs to CRIU. The same FDs
are then returned during the UPDATE_VMAMAP hooks so that they can be
used by CRIU to call mmap. Duplicated FDs created using dup are
references to the same struct file inside the kernel so they are also
allowed to mmap.
To prevent the opened FDs inside amdgpu_plugin from conflicting with
FDs used by the target restore application, we make sure that the
lowest-numbered FD that amdgpu_plugin will use is greater than the
highest-numbered FD that is used by the target application.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00
Rajneesh Bhardwaj
bd83330095 criu/plugin: Implement sDMA based buffer access
AMD Radeon GPUs have special sDMA (system dma engines) IPs that can be
used to speed up the read write operations from the VRAM and GTT memory.

Depends on:

* The kernel mode driver (kfd) creating the dmabuf objects for the kfd
  BOs in both checkpoint and restore operation.
* libdrm and libdrm_amdgpu libraries

Suggested-by: Felix Kuehling <felix.kuehling@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
6d79266229 criu/plugin: Restore libhsakmt shared memory files
Libhsakmt(thunk) uses a shared memory file in /dev/shm/hsakmt_shared_mem
and its semaphore in /dev/shm/hsakmt_shared_mem. Adding a check during
checkpoint to see if these two files exist. If they exist then the
plugin will try to restore them during restore.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
a218fe0baa criu/plugin: Read and write BO contents in parallel
Implement multi-threaded code to read and write contents of each GPU
VRAM BOs in parallel in order to speed up dumping process when using
multiple GPUs.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
2022-04-28 17:53:52 -07:00
David Yat Sin
ba9c62df24 criu/plugin: Add unit tests for GPU remapping
Adding unit tests for GPU remapping code when checkpointing and
restoring on different nodes with different topologies.

Signed-off-by: David Yat Sin <david.yatsin@amd.com>
2022-04-28 17:53:52 -07:00