2
0
mirror of https://github.com/checkpoint-restore/criu synced 2025-08-22 01:51:51 +00:00

11635 Commits

Author SHA1 Message Date
Adrian Reber
237ac72c32 vdso: switch from DT_HASH to DT_GNU_HASH (aarch64)
Trying to run latest CRIU on CentOS Stream 10 or Ubuntu 24.04 (aarch64)
fails like this:

    # criu/criu check -v4
    [...]
    (00.096460) vdso: Parsing at ffffb2e2a000 ffffb2e2c000
    (00.096539) vdso: PT_LOAD p_vaddr: 0
    (00.096567) vdso: DT_STRTAB: 1d0
    (00.096592) vdso: DT_SYMTAB: 128
    (00.096616) vdso: DT_STRSZ: 8a
    (00.096640) vdso: DT_SYMENT: 18
    (00.096663) Error (criu/pie-util-vdso.c:193): vdso: Not all dynamic entries are present
    (00.096688) Error (criu/vdso.c:627): vdso: Failed to fill self vdso symtable
    (00.096713) Error (criu/kerndat.c:1906): kerndat_vdso_fill_symtable failed when initializing kerndat.
    (00.096812) Found mmap_min_addr 0x10000
    (00.096881) files stat: fs/nr_open 1073741816
    (00.096908) Error (criu/crtools.c:267): Could not initialize kernel features detection.

This seems to be related to the kernel (6.12.0-41.el10.aarch64). The
Ubuntu user-space is running in a container on the same kernel.

Looking at the kernel this seems to be related to:

    commit 48f6430505c0b0498ee9020ce3cf9558b1caaaeb
    Author: Fangrui Song <i@maskray.me>
    Date:   Thu Jul 18 10:34:23 2024 -0700

        arm64/vdso: Remove --hash-style=sysv

        glibc added support for .gnu.hash in 2006 and .hash has been obsoleted
        for more than one decade in many Linux distributions.  Using
        --hash-style=sysv might imply unaddressed issues and confuse readers.

        Just drop the option and rely on the linker default, which is likely
        "both", or "gnu" when the distribution really wants to eliminate sysv
        hash overhead.

        Similar to commit 6b7e26547fad ("x86/vdso: Emit a GNU hash").

The commit basically does:

    -ldflags-y := -shared -soname=linux-vdso.so.1 --hash-style=sysv \
    +ldflags-y := -shared -soname=linux-vdso.so.1 \

Which results in only a GNU hash being added to the ELF header. This
change has been merged with 6.11.

Looking at the referenced x86 commit:

    commit 6b7e26547fad7ace3dcb27a5babd2317fb9d1e12
    Author: Andy Lutomirski <luto@amacapital.net>
    Date:   Thu Aug 6 14:45:45 2015 -0700

        x86/vdso: Emit a GNU hash

        Some dynamic loaders may be slightly faster if a GNU hash is
        available.  Strangely, this seems to have no effect at all on
        the vdso size.

        This is unlikely to have any measurable effect on the time it
        takes to resolve vdso symbols (since there are so few of them).
        In some contexts, it can be a win for a different reason: if
        every DSO has a GNU hash section, then libc can avoid
        calculating SysV hashes at all.  Both musl and glibc appear to
        have this optimization.

        It's plausible that this breaks some ancient glibc version.  If
        so, then, depending on what glibc versions break, we could
        either require COMPAT_VDSO for them or consider reverting.

Which is also a really simple change:

    -VDSO_LDFLAGS = -fPIC -shared $(call cc-ldoption, -Wl$(comma)--hash-style=sysv) \
    +VDSO_LDFLAGS = -fPIC -shared $(call cc-ldoption, -Wl$(comma)--hash-style=both) \

The big difference here is that for x86 both hash sections are
generated. For aarch64 only the newer GNU hash is generated. That is why
we only see this error on kernel >= 6.11 and aarch64.

Changing from DT_HASH to DT_GNU_HASH seems to work on aarch64.  The test
suite runs without any errors.

Unfortunately I am not aware of all implication of this change and if a
successful test suite run means that it still works.

Looking at the kernel I see following hash styles for the VDSO:

aarch64: not specified (only GNU hash style)
arm: --hash-style=sysv
loongarch: --hash-style=sysv
mips: --hash-style=sysv
powerpc: --hash-style=both
riscv: --hash-style=both
s390: --hash-style=both
x86: --hash-style=both

Only aarch64 on kernels >= 6.11 is a problem right now, because all
other platforms provide the old style hashing.

Signed-off-by: Adrian Reber <areber@redhat.com>
Co-developed-by: Dmitry Safonov <dima@arista.com>
Co-authored-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
2025-02-03 13:56:30 -08:00
Pavel Tikhomirov
1c9fd58ff0 zdtm/netns_sub_sysctl: add ipv4/ping_group_range sysctl check
Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2025-02-03 12:35:58 +08:00
Pavel Tikhomirov
f38e58836a net/sysctl: c/r ipv4/ping_group_range value
It is per net namespace, we need it to allow creation of unprivileged
ICMP sockets.

Note: in case this sysctl was disabled after unprivileged ICMP
socket was created we still need to somehow handle it on restore.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2025-02-03 12:35:58 +08:00
Pavel Tikhomirov
7f35e46e9d net/sysctl: put common multiplier outside the brackets
Also add an explanation of the logic behind this calculation.

Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
2025-02-03 12:35:58 +08:00
Adrian Reber
7eaf43368d ci: handle results from latest codespell
CI pulls in a newer version of codespell. This fixes complaints from
that codespell version.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-01-29 13:18:31 -08:00
Adrian Reber
343e7319b9 lib: do not set protobuf has_* field too early
For two cases libcriu was setting the RPC protobuf field `has_*` before
checking if the given parameter is valid. This can lead to situations,
if the caller doesn't check the return value, that we pass as RPC struct
to CRIU which has the `has_*` protobuf field set to true, but does not
have a verified value (or non at all) set for the actual RPC entry.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-01-28 10:45:22 -08:00
Radostin Stoyanov
fc1dbc4915 cuda: disable CUDA plugin for pre-dump
Temporarily disable CUDA plugin for `criu pre-dump`.

pre-dump currently fails with the following error:

Handling VMA with the following smaps entry: 1822c000-18da5000 rw-p 00000000 00:00 0                                  [heap]
Handling VMA with the following smaps entry: 200000000-200200000 ---p 00000000 00:00 0
Handling VMA with the following smaps entry: 200200000-200400000 rw-s 00000000 00:06 895                              /dev/nvidia0
Error (criu/proc_parse.c:116): handle_device_vma plugin failed: No such file or directory
Error (criu/proc_parse.c:632): Can't handle non-regular mapping on 705693's map 200200000
Error (criu/cr-dump.c:1486): Collect mappings (pid: 705693) failed with -1

We plan to enable support for pre-dump by skipping nvidia mappings
in a separate patch.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-01-28 10:41:45 -08:00
Radostin Stoyanov
dcd8808db0 seize: use separate checkpoint_devices function
Move `run_plugins(CHECKPOINT_DEVICES)` out of `collect_pstree()` to
ensure that the function's sole responsibility is to use the cgroup
freezer for the process tree. This allows us to avoid a time-out
error when checkpointing applications with large GPU state.

v2: This patch calls `checkpoint_devices()` only for `criu dump`.
Support for GPU checkpointing with `pre-dump` will be introduced in
a separate patch.

Suggested-by: Andrei Vagin <avagin@google.com>
Suggested-by: Jesus Ramos <jeramos@nvidia.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-01-28 10:41:45 -08:00
Radostin Stoyanov
59b022db35 cuda: prevent task lockup on timeout error
When creating a checkpoint of large models, the `checkpoint` action of
`cuda-checkpoint` can exceed the CRIU timeout. This causes CRIU to fail
with the following error, leaving the CUDA task in a locked state:

	cuda_plugin: Checkpointing CUDA devices on pid 84145 restore_tid 84202
	Error (criu/cr-dump.c:1791): Timeout reached. Try to interrupt: 0
	Error (cuda_plugin.c:139): cuda_plugin: Unable to read output of cuda-checkpoint: Interrupted system call
	Error (cuda_plugin.c:396): cuda_plugin: CHECKPOINT_DEVICES failed with
	net: Unlock network
	cuda_plugin: finished cuda_plugin stage 0 err -1
	cuda_plugin: resuming devices on pid 84145
	cuda_plugin: Restore thread pid 84202 found for real pid 84145
	Unfreezing tasks into 1
		Unseizing 84145 into 1
	Error (criu/cr-dump.c:2111): Dumping FAILED.

To fix this, we set `task_info->checkpointed` before invoking
the `checkpoint` action to ensure that the CUDA task is resumed
even if CRIU times out.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2025-01-28 10:41:45 -08:00
Adrian Reber
5513a33300 net: remember the name of the lock chain (nftables)
Using libnftables the chain to lock the network is composed of
("CRIU-%d", real_pid). This leads to around 40 zdtm tests failing
with errors like this:

Error: No such file or directory; did you mean table 'CRIU-62' in family inet?
delete table inet CRIU-86

The reason is that as soon as a process is running in a namespace the
real PID can be anything and only the PID in the namespace is restored
correctly. Relying on the real PID does not work for the chain name.

Using the PID of the innermost namespace would lead to the chain be
called 'CRIU-1' most of the time which is also not really unique.

With this commit the change is now named using the already existing CRIU
run ID. To be able to correctly restore the process and delete the
locking table, the CRIU run id during checkpointing is now stored in the
inventory as dump_criu_run_id.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-01-28 10:38:30 -08:00
Adrian Reber
d165b94bb5 criu: use libuuid for criu_run_id generation
criu_run_id will be used in upcoming changes to create and remove
network rules for network locking. Instead of trying to come up with
a way to create unique IDs, just use an existing library.

libuuid should be installed on most systems as it is indirectly required
by systemd (via libmount).

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-01-28 10:38:30 -08:00
Adrian Reber
b3869c9172 ci: two check-commits.yml changes
* Switch to v4 actions/checkout (from v3)
 * Use our apt wrapper to gracefully handle temporary repository errors

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-01-24 10:30:51 -08:00
Austin Kuo
b7cbd2ca92 test/zdtm: add a new test to check non-periodic timers
It creates a few timers with log expiration intervals, waites for C/R
and check that timers are armed and their intervals have been restored.

Signed-off-by: Austin Kuo <hsuanchikuo@gmail.com>
2025-01-21 17:34:07 -08:00
Austin Kuo
5eee7a6ee2 timer: Refine itimer_armed logic and improve timer value handling
Right now, CRIU skips timers non-periodic timers. This change addresses
this issue.

Signed-off-by: Austin Kuo <hsuanchikuo@gmail.com>
2025-01-21 17:33:53 -08:00
Adrian Reber
637682d8aa test: fix cmdlinenv00 on aarch64
On aarch64 the test cmdlinenv00 was failing with:

  FAIL: cmdlinenv00.c:120: auxv corrupted on restore (errno = 11 (Resource temporarily unavailable))

Starting with Linux kernel version 6.3 the size of AUXV was changed:

    commit 28c8e088427ad30b4260953f3b6f908972b77c2d
    Author: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
    Date:   Wed Jan 4 14:20:54 2023 -0500

        rseq: Increase AT_VECTOR_SIZE_BASE to match rseq auxvec entries

        Two new auxiliary vector entries are introduced for rseq without
        matching increment of the AT_VECTOR_SIZE_BASE, which causes failures
        with CONFIG_HARDENED_USERCOPY=y.

        Fixes: 317c8194e6ae ("rseq: Introduce feature size and alignment ELF auxiliary vector entries")

With this change AT_VECTOR_SIZE increases from 40 to 50 on aarch64. CRIU
uses AT_VECTOR_SIZE to read the content of /proc/PID/auxv

        auxv_t mm_saved_auxv[AT_VECTOR_SIZE];
        ret = read(fd, mm_saved_auxv, sizeof(mm_saved_auxv));

Now the tests works again on aarch64.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-01-21 09:17:06 -08:00
Adrian Reber
10ffad2188 files-reg: fix buffer overflow on aarch64
Running the zdtm/static/unlink_regular00 test on Ubuntu 24.04 on aarch64
results in following error:

    # ./zdtm.py run -t zdtm/static/unlink_regular00 -k always
    userns is supported
    === Run 1/1 ================ zdtm/static/unlink_regular00
    ==================== Run zdtm/static/unlink_regular00 in ns ====================
    Skipping rtc at root
    Start test
    Test is SUID
    ./unlink_regular00 --pidfile=unlink_regular00.pid --outfile=unlink_regular00.out --dirname=unlink_regular00.test
    Run criu dump
    *** buffer overflow detected ***: terminated
    ############# Test zdtm/static/unlink_regular00 FAIL at CRIU dump ##############
    Test output: ================================

     <<< ================================
    Send the 9 signal to  47
    Wait for zdtm/static/unlink_regular00(47) to die for 0.100000
    ##################################### FAIL #####################################

According to the backtrace:

    #0  __pthread_kill_implementation (threadid=281473158467616, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
    #1  0x0000ffff93477690 in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
    #2  0x0000ffff9342cb3c in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
    #3  0x0000ffff93417e00 in __GI_abort () at ./stdlib/abort.c:79
    #4  0x0000ffff9346abf0 in __libc_message_impl (fmt=fmt@entry=0xffff93552a78 "*** %s ***: terminated\n") at ../sysdeps/posix/libc_fatal.c:132
    #5  0x0000ffff934e81a8 in __GI___fortify_fail (msg=msg@entry=0xffff93552a28 "buffer overflow detected") at ./debug/fortify_fail.c:24
    #6  0x0000ffff934e79e4 in __GI___chk_fail () at ./debug/chk_fail.c:28
    #7  0x0000ffff934e9070 in ___snprintf_chk (s=s@entry=0xffffc6ed04a3 "testfile", maxlen=maxlen@entry=4056, flag=flag@entry=2, slen=slen@entry=4053,
        format=format@entry=0xaaaacffe3888 "link_remap.%d") at ./debug/snprintf_chk.c:29
    #8  0x0000aaaacff4b8b8 in snprintf (__fmt=0xaaaacffe3888 "link_remap.%d", __n=4056, __s=0xffffc6ed04a3 "testfile")
        at /usr/include/aarch64-linux-gnu/bits/stdio2.h:54
    #9  create_link_remap (path=path@entry=0xffffc6ed2901 "/zdtm/static/unlink_regular00.test/subdir/testfile", len=len@entry=60, lfd=lfd@entry=20,
        idp=idp@entry=0xffffc6ed14ec, nsid=nsid@entry=0xaaaada2bac00, parms=parms@entry=0xffffc6ed2808, fallback=0xaaaacff4c6c0 <dump_linked_remap+96>,
        fallback@entry=0xffffc6ed2797) at criu/files-reg.c:1164
    #10 0x0000aaaacff4c6c0 in dump_linked_remap (path=path@entry=0xffffc6ed2901 "/zdtm/static/unlink_regular00.test/subdir/testfile", len=len@entry=60,
        parms=parms@entry=0xffffc6ed2808, lfd=lfd@entry=20, id=id@entry=12, nsid=nsid@entry=0xaaaada2bac00, fallback=fallback@entry=0xffffc6ed2797)
        at criu/files-reg.c:1198
    #11 0x0000aaaacff4d8b0 in check_path_remap (nsid=0xaaaada2bac00, id=12, lfd=20, parms=0xffffc6ed2808, link=<optimized out>) at criu/files-reg.c:1426
    #12 dump_one_reg_file (lfd=20, id=12, p=0xffffc6ed2808) at criu/files-reg.c:1827
    #13 0x0000aaaacff51078 in dump_one_file (pid=<optimized out>, fd=4, lfd=20, opts=opts@entry=0xaaaada2ba2c0, ctl=ctl@entry=0xaaaada2c4d50,
        e=e@entry=0xffffc6ed39c8, dfds=dfds@entry=0xaaaada2c3d40) at criu/files.c:581
    #14 0x0000aaaacff5176c in dump_task_files_seized (ctl=ctl@entry=0xaaaada2c4d50, item=item@entry=0xaaaada2b8f80, dfds=dfds@entry=0xaaaada2c3d40)
        at criu/files.c:657
    #15 0x0000aaaacff3d3c0 in dump_one_task (parent_ie=0x0, item=0xaaaada2b8f80) at criu/cr-dump.c:1679
    #16 cr_dump_tasks (pid=<optimized out>) at criu/cr-dump.c:2224
    #17 0x0000aaaacff163a0 in main (argc=<optimized out>, argv=0xffffc6ed40e8, envp=<optimized out>) at criu/crtools.c:293

This line is the problem:

    snprintf(tmp + 1, sizeof(link_name) - (size_t)(tmp - link_name - 1), "link_remap.%d", rfe.id);

The problem was that the `-1` was on the inside of the braces and not on
the outside. This way the destination size was increase by 1 instead of
being decreased by 1 which triggered the buffer overflow detection.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-01-21 09:15:40 -08:00
Yuanhong Peng
ca90d8e7eb seize: Adjust the position of the log message
Based on the code, the `ret` variable at this point does not
represent the task state, so this log message should be
moved to a position after the `compel_wait_task()` function.

Signed-off-by: Yuanhong Peng <yummypeng@linux.alibaba.com>
2025-01-19 11:41:24 +00:00
Adrian Reber
27a5b9aa87 net: redirect nftables stdout and stderr to CRIU's log file
When using the nftables network locking backend and restoring a process
a second time the network locking has already been deleted by the first
restore. The second restore will print out to the console text like:

Error: Could not process rule: No such file or directory
delete table inet CRIU-202621

With this change CRIU's log FD is used by libnftables stdout and stderr.

Signed-off-by: Adrian Reber <areber@redhat.com>
2025-01-17 09:10:36 -08:00
Adrian Reber
ea2ddb886a util: added cleanup_file attribute.
Signed-off-by: Adrian Reber <areber@redhat.com>
2025-01-17 09:10:36 -08:00
Liu Chao
d4d3937017 zdtm: Check CapAmb is restored correctly after C/R
This test sets CapAmb according to CapPrm and CapInh and check CapAmb
after C/R.

Signed-off-by: Liu Chao <liuchao173@huawei.com>
2025-01-09 21:28:17 -08:00
Liu Chao
6991ea1ff9 cr: Task CapAmb support
Signed-off-by: Liu Chao <liuchao173@huawei.com>
2025-01-09 21:28:17 -08:00
Kir Kolyshkin
7c66617d0e freeze_processes: implement kludges for cgroup v1
Cgroup v1 freezer has always been problematic, failing to freeze a
cgroup.

In runc, we have implemented a few kludges to increase the chance of
succeeding, but those are used when runc freezes a cgroup for its own
purposes (for "runc pause" and to modify device properties for cgroup
v1).

When criu is used, it fails to freeze a cgroup from time to time
(see [1], [2]). Let's try adding kludges similar to ones in runc.

Alas, I have absolutely no way to test this, so please review carefully.

[1]: https://github.com/opencontainers/runc/issues/4273
[2]: https://github.com/opencontainers/runc/issues/4457

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-01-06 20:12:35 -08:00
Kir Kolyshkin
9c3c095cfe freeze_processes: fix logic
There are a few issues with the freeze_processes logic:

1. Commit 9fae23fbe2 grossly (by 1000x) miscalculated the number of
   attempts required, as a result, we are seeing something like this:

> (00.000340) freezing processes: 100000 attempts with 100 ms steps
> (00.000351) freezer.state=THAWED
> (00.000358) freezer.state=FREEZING
> (00.100446) freezer.state=FREEZING
> ...close to 100 lines skipped...
> (09.915110) freezer.state=FREEZING
> (10.000432) Error (criu/cr-dump.c:1467): Timeout reached. Try to interrupt: 0
> (10.000563) freezer.state=FREEZING

   For 10s with 100ms steps we only need 100 attempts, not 100000.

2. When the timeout is hit, the "failed to freeze cgroup" error is not
   printed, and the log_unfrozen_stacks is not called either.

3. The nanosleep at the last iteration is useless (this was hidden by
   issue 1 above, as the timeout was hit first).

Fix all these.

While at it,

4. Amend the error message with the number of attempts, sleep duration,
   and timeout.

5. Modify the "freezing cgroup" debug message to be in sync with the
   above error.

   Was:

   > freezing processes: 100000 attempts with 100 ms steps

   Now:

   > freezing cgroup some/name: 100 x 100ms attempts, timeout: 10s

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-01-06 20:12:35 -08:00
Kir Kolyshkin
f314ca5e1f criu/seize.c: clang-format it
Done using clang-format 19.1.5 with .clang-format obtained via
scripts/fetch-clang-format.sh.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2025-01-06 20:12:35 -08:00
Andrei Vagin
32d5a766ee test: run scm06 in the ns and uns flavors
The kernel releases a test socket asynchronously, so the restore can
fail if it is executed before the kernel actually destroys the socket.

Fixes #2537

Signed-off-by: Andrei Vagin <avagin@google.com>
2024-12-14 23:22:51 -08:00
Andrei Vagin
d46cbf76ed test/java: increate the ghost file limit
Right now, this test fails with this error:
Error (criu/files-reg.c:1031): Can't dump ghost file
  /criu/test/javaTests/omrvmem_000000626_Mlm48x of 2097152 size,
  increase limit

Signed-off-by: Andrei Vagin <avagin@google.com>
2024-12-12 16:39:58 +00:00
Jesus Ramos
6d1da61482 cuda: Fix return value from CHECKPOINT_DEVICES hook so that dump's fail properly
cuda-checkpoint returns the positive CUDA error code when it runs into an issue
and passing that along as the return value would cause errors to get ignored

Signed-off-by: Jesus Ramos <jeramos@nvidia.com>
2024-12-10 14:19:05 -08:00
Andrei Vagin
058572e91d vdso: handle vvar_vclock vma-s
The vvar_vclock was introduced by [1]. Basically, the old vvar vma has
been splited on two parts. In term of C/R, these two vma-s can be still
treated as one.

[1] e93d2521b27f ("x86/vdso: Split virtual clock pages into dedicated mapping")

Signed-off-by: Andrei Vagin <avagin@google.com>
2024-12-08 08:58:07 +00:00
Radostin Stoyanov
beff27eca1 pidfd: add missing include
Fix for the following error when building CRIU on Rocky Linux 8

criu/pidfd.c: In function ‘pidfd_open’:
criu/pidfd.c:119:17: error: ‘__NR_pidfd_open’ undeclared (first use in this function); did you mean ‘pidfd_open’?
  return syscall(__NR_pidfd_open, pid, flags);
                 ^~~~~~~~~~~~~~~
                 pidfd_open
criu/pidfd.c:119:17: note: each undeclared identifier is reported only once for each function it appears in
criu/pidfd.c:120:1: error: control reaches end of non-void function [-Werror=return-type]
 }
 ^
criu/pidfd.c: At top level:
cc1: error: unrecognized command line option ‘-Wno-unknown-warning-option’ [-Werror]
cc1: error: unrecognized command line option ‘-Wno-dangling-pointer’ [-Werror]
cc1: all warnings being treated as errors

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-12-04 08:59:40 -08:00
Alexander Mikhalitsyn
1452c76f65 compel/arch/riscv64: properly implement compel_task_size()
We need to dynamically calculate TASK_SIZE depending
on the MMU on RISC-V system. [We are using analogical
approach on aarch64/ppc64le.]

This change was tested on physical machine:
StarFive VisionFive 2
isa		: rv64imafdc_zicntr_zicsr_zifencei_zihpm_zca_zcd_zba_zbb
mmu		: sv39
uarch		: sifive,u74-mc
mvendorid	: 0x489
marchid		: 0x8000000000000007
mimpid		: 0x4210427
hart isa	: rv64imafdc_zicntr_zicsr_zifencei_zihpm_zca_zcd_zba_zbb

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2024-11-21 14:44:52 -08:00
Alexander Mikhalitsyn
7a8ed9e210 compel: fix gitignore and remove autogenerated code
We don't need to have compel/arch/riscv64/plugins/std/syscalls/syscalls.S
tracked in git. It is autogenerated. We also need to update our .gitignore
to ignore autogenerated files with syscall tables.

Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2024-11-21 14:44:52 -08:00
Radostin Stoyanov
dd6b580b43 test: add get-state to mocked cuda-checkpoint tool
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-11-13 07:07:54 -08:00
Radostin Stoyanov
d6e5e7677f cuda: enable checkpoint support for paused tasks
If a CUDA process is already in a "locked" or "checkpointed" state
during criu dump, the CUDA plugin currently fails with an error because
it attempts an unnecessary "lock" action using the cuda-checkpoint tool.

This patch extends the CUDA plugin to handle such cases by first
verifying the initial state of the CUDA processes and skipping
unnecessary "lock" and "checkpoint" actions when a process has been
locked or checkpointed before CRIU is invoked.

In particular, CUDA tasks may already be in a "locked" or "checkpointed"
state to ensure consistent checkpoint/restore for distributed workloads,
such as model training, where multiple containers run across different
cluster nodes.

Another use case for this functionality is optimizing resource
utilization, where CUDA tasks with low-priority are preempted
immediately to release GPU resources needed by high-priority
tasks, and the paused workloads are later resumed or migrated
to another node.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-11-13 07:07:54 -08:00
Bhavik Sachdev
223a8f1e86 zdtm: Check many processes with common dead pidfd
We have multiple processes open a pidfd to a common dead process.
After C/R we check that the inode numbers for these pidfds are equal or
not.

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
2024-11-12 12:28:21 -08:00
Andrei Vagin
6f0ec7def6 pidfd: one process creates a helper and opens all fds to it
Currently, the `waitpid()` call on the tmp process can be made by a
process which is not its parent. This causes restore to fail.

This patch instead selects one process to create the tmp process and
open all the fds that point to it. These fds are sent to the correct
process(es).

Fixes: #2496

Signed-off-by: Andrei Vagin <avagin@gmail.com>
Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
2024-11-12 12:28:21 -08:00
Radostin Stoyanov
26dcc216c2 cuda: fix check for GPU device availability
The check for `/dev/nvidiactl` to determine if the CUDA plugin can be
used is unreliable because in some cases the default path for driver
installation is different [1]. This patch changes the logic to check
if a GPU device is available in `/proc/driver/nvidia/gpus/`. This
approach is similar to `torch.cuda.is_available()` and it is a more
accurate indicator.

The subsequent check for support of the `cuda-checkpoint --action`
option would confirm if the driver supports checkpoint/restore.

[1] https://github.com/NVIDIA/gpu-operator

Fixes: #2509

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-11-11 19:35:01 -08:00
Radostin Stoyanov
31b38d662d ci: test interrupt-only mode with frozen cgroup
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-11-11 19:33:31 -08:00
Radostin Stoyanov
f8f0e1df76 seize: enable support for frozen containers
Container runtimes like CRI-O and containerd utilize the freezer cgroup
to create a consistent snapshot of container root filesystem (rootfs)
changes. In this case, the container is frozen before invoking CRIU.
After CRIU successfully completes, a copy of the container rootfs diff
is saved, and the container is then unfrozen.

However, the `cuda-checkpoint` tool is not able to perform a 'lock'
action on frozen threads.  To support GPU checkpointing with these
container runtimes, we need to unfreeze the cgroup and return it to its
original state once the checkpointing is complete.

To reflect this new behavior, the following changes are applied:
 - `dont_use_freeze_cgroup(void)` -> `set_compel_interrupt_only_mode(void)`
 - `bool freeze_cgroup_disabled` -> `bool compel_interrupt_only_mode`
 - `check_freezer_cgroup(void)` -> `prepare_freezer_for_interrupt_only_mode(void)`

Note that when `compel_interrupt_only_mode` is set to `true`,
`compel_interrupt_task()` is used instead of `freeze_processes()`
to prevent tasks from running during `criu dump`.

Fixes: #2508

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-11-11 19:33:31 -08:00
Radostin Stoyanov
216d804aab seize: fix error handling for check_freezer_cgroup
When `check_freezer_cgroup()` has non-zero return value, `goto err` calls
`return ret`. However, the value of `ret` has been set to `0` in the lines
above and CRIU does not handle the error properly.

This problem is related to https://github.com/checkpoint-restore/criu/issues/2508

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-11-06 08:56:13 -08:00
Lorenzo Fontana
dcc3b49619 criu: Initialize util before service worker starts
When restoring dumps in new mount + pid namespaces where multiple dumps
share the same network namespace, CRIU may fail due to conflicting
unix socket names. This happens because the service worker creates
sockets using a pattern that includes criu_run_id, but util_init()
is called after cr_service_work() starts.

The socket naming pattern "crtools-fd-%d-%d" uses the restore PID
and criu_run_id, however criu_run_id is always 0 when not initialized,
leading to conflicts when multiple restores run simultaneously either
in the same CRIU process or because of multiple CRIU processes
doing the same operation in different PID namespaces.

Fix this by:

- Moving util_init() before cr_service_work() starts
- Adding a second util_init() call in the service worker fork
to ensure unique IDs across multiple worker runs
- Making sure that dump and restore operations have util_init() called
early to generate unique socket names

With this fix, socket names always include the namespace ID, preventing
conflicts when multiple processes with the same pid share a network
namespace.

Fixes #2499

[ avagin: minore code changes ]

Signed-off-by: Lorenzo Fontana <fontanalorenz@gmail.com>
Signed-off-by: Andrei Vagin <avagin@google.com>
2024-10-31 00:00:14 -07:00
Liu Hua
f5dec056ad uffd: Disable image deduplication after fork
After a fork, both the child and parent processes may trigger a page fault (#PF)
at the same virtual address, referencing the same position in the page image.
If deduplication is enabled, the last process to trigger the page fault will fail.

Therefore, deduplication should be disabled after a fork to prevent this issue.

Signed-off-by: Liu Hua <weldonliu@tencent.com>
2024-10-26 22:18:22 -07:00
Cryolitia PukNgae
f6baf8143b include: don't use GCC's __builtin_ffs on riscv64
Link: e300da4db4

Signed-off-by: PukNgae Cryolitia <Cryolitia@gmail.com>
---
- cherry-picked
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2024-10-26 22:18:22 -07:00
Haorong Lu
986376929e ci: add workflow for riscv64
Signed-off-by: Haorong Lu <ancientmodern4@gmail.com>
2024-10-26 22:18:22 -07:00
Haorong Lu
663678222c zdtm: add riscv64 support
Signed-off-by: Haorong Lu <ancientmodern4@gmail.com>
2024-10-26 22:18:22 -07:00
Haorong Lu
35b30774fc criu: add riscv64 support to parasite and restorer
Co-authored-by: Yixue Zhao <felicitia2010@gmail.com>
Co-authored-by: stove <stove@rivosinc.com>
Signed-off-by: Haorong Lu <ancientmodern4@gmail.com>
2024-10-26 22:18:22 -07:00
Haorong Lu
1a42f63d30 images: add riscv64 core image
Co-authored-by: Yixue Zhao <felicitia2010@gmail.com>
Co-authored-by: stove <stove@rivosinc.com>
Signed-off-by: Haorong Lu <ancientmodern4@gmail.com>
2024-10-26 22:18:22 -07:00
Haorong Lu
7fd95a509d compel: add riscv64 support
Co-authored-by: Yixue Zhao <felicitia2010@gmail.com>
Co-authored-by: stove <stove@rivosinc.com>
Signed-off-by: Haorong Lu <ancientmodern4@gmail.com>
---
- rebased
- added a membarrier() to syscall table (fix authored by Cryolitia PukNgae)
Signed-off-by: PukNgae Cryolitia <Cryolitia@gmail.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2024-10-26 22:18:22 -07:00
Haorong Lu
0d2d23b6d0 include: add common header files for riscv64
Co-authored-by: Yixue Zhao <felicitia2010@gmail.com>
Co-authored-by: stove <stove@rivosinc.com>
Signed-off-by: Haorong Lu <ancientmodern4@gmail.com>
---
- rebased
- imported a page_size() type fix (authored by Cryolitia PukNgae)
Signed-off-by: PukNgae Cryolitia <Cryolitia@gmail.com>
Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com>
2024-10-26 22:18:22 -07:00
Bhavik Sachdev
d8be857b4b pidfd: block SIGCHLD during tmp process creation
This patch blocks SIGCHLD during temporary process creation to prevent a
race condition between kill() and waitpid() where sigchld_handler()
causes `criu restore` to fail with an error.

Fixes: #2490

Signed-off-by: Bhavik Sachdev <b.sachdev1904@gmail.com>
Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-10-26 22:18:22 -07:00
Radostin Stoyanov
e6ce8f4054 zdtm: add inventory test plugins
This patch adds two test plugins to verify that CRIU plugins listed
in the inventory image are enabled, while those that are not listed
can be disabled.

Signed-off-by: Radostin Stoyanov <rstoyanov@fedoraproject.org>
2024-10-26 22:18:22 -07:00