[spam][crazy] Diagnosing an Explosive Failure
# build vmem-wip of mimalloc git clone https://github.com/xloem/mimalloc mkdir mimalloc/build cd mimalloc/build cmake .. make -j cd ../.. # use it to build pytorch git clone https://github.com/pytorch/pytorch cd pytorch LD_PRELOAD=../../mimalloc/build/libmimalloc.so VMEM_PREFIX=vmem python3 setup.py build I've tried the above just about two or three times on a raspberry pi running ubuntu 64. When I do, a SIGKILL is sent ijmmediately to all running processes. My PTYs close. GDM restarts its session. Quite surprising that a software bug run by a non-root user would cause that. I'm still presently pretty inhibited around root-process and live-kernel debugging, so my approach would be to narrow down the build process and the code executed using manual bisection of source lines. The likely culprit is the addition of libseccomp to the code, as a stop-gap to prevent the use of fork() corrupting memory: + // disable fork and clone, which would write to the same vmem file + void * fctx = seccomp_init(SCMP_ACT_ALLOW); + if (fctx) { + seccomp_rule_add(fctx, SCMP_ACT_ERRNO(ENOMEM), SCMP_SYS(fork), 0); + seccomp_rule_add(fctx, SCMP_ACT_ERRNO(ENOMEM), SCMP_SYS(clone), 0); + seccomp_load(fctx); + seccomp_release(fctx); + } I'm not sure if the disabling of clone() is actually needed as well, I just saw somewhere they were similar similar calls. The explosive error doesn't happen on the small test case I wrote. Only on the large build. I may not debug this immediately as I have other things going on today, but it seemed pleasant I guess to add it to the list. It's not often you encounter a software bug that resets your entire system interface.
left the branch checkout out of the steps at top: On 1/5/22, k <gmkarl@gmail.com> wrote:
# build vmem-wip of mimalloc git clone https://github.com/xloem/mimalloc mkdir mimalloc/build cd mimalloc/build
git checkout vmem-wip # should go here
cmake .. make -j cd ../..
# use it to build pytorch git clone https://github.com/pytorch/pytorch cd pytorch LD_PRELOAD=../../mimalloc/build/libmimalloc.so VMEM_PREFIX=vmem python3 setup.py build
I've tried the above just about two or three times on a raspberry pi running ubuntu 64. When I do, a SIGKILL is sent ijmmediately to all running processes. My PTYs close. GDM restarts its session. Quite
and on the second steps, i have two ../ where there should be 1 On 1/5/22, k <gmkarl@gmail.com> wrote:
left the branch checkout out of the steps at top:
On 1/5/22, k <gmkarl@gmail.com> wrote:
# build vmem-wip of mimalloc git clone https://github.com/xloem/mimalloc mkdir mimalloc/build cd mimalloc/build
git checkout vmem-wip # should go here
cmake .. make -j cd ../..
# use it to build pytorch git clone https://github.com/pytorch/pytorch cd pytorch LD_PRELOAD=../../mimalloc/build/libmimalloc.so VMEM_PREFIX=vmem python3 setup.py build
should be more like (note only 1 '../'): export LD_PRELOAD=../mimalloc/build/libmimalloc.so VMEM_PREFIX=vmem python3 setup.py build
I've tried the above just about two or three times on a raspberry pi running ubuntu 64. When I do, a SIGKILL is sent ijmmediately to all running processes. My PTYs close. GDM restarts its session. Quite
LD_PRELOAD path should have "$(pwd)" at the start of it, since build process will change cwd
I tried spinning up a docker container to test this in. I pulled ubuntu:impish and ran a new container with a tty. But it is unusably slow. It takes many minutes to process every command I type in before showing the next shell prompt. It would be faster to test without docker, rebooting every test.
My example doesn't reproduce it, but this does: git clone --depth=1 https://github.com/pytorch/pytorch git clone --branch vmem-wip https://github.com/xloem/mimalloc cd mimalloc cmake . make LD_PRELOAD=$(pwd)/libmimalloc.so VMEM_PREFIX=vmem cmake ../pytorch cmake on the pytorch folder sends out the SIGKILLs. just a vanilla 'cmake' does not.
I commented all the lines of pytorch's CMakeLists.txt except for the first line specifying the cmake version or somesuch at the top. cmake ran without severe issue, complaining the local CMakeCache.txt did not match the CMakeLists.txf file. I deleted the CMakeCache.txt file and reran the command, and all my processes were killed again.
I removed the version line and all processes were still killed. maybe this occurrence relates to there not being an existing cmakecache file, since it worked on mimalloc's cmake
i rebooted an extra time because chrome stopped responding to my raspberry pi touchscreen this example is sufficient to cause the systemwide crash: git clone --branch vmem-wip https://github.com/xloem/mimalloc mkdir mimalloc/build cd mimalloc/build cmake .. make rm CMakeCache.txt touch CMakeLists.txt LD_PRELOAD=$(pwd)/libmimalloc.so VMEM_PREFIX=vmem cmake . pytorch need not be involved the next step is to get cmake sources and see if a cmake built from source triggers the issue then maybe i can add the seccomp calls to cmake directly, and narrow it down to an example that could be in a bug report
this message was meant to go here, under the spam tag On Wed, Jan 5, 2022, 8:31 AM k <gmkarl@gmail.com> wrote: I checked my system for updates, and there was a kernel update, but i/o started hanging while apt was updating boot, so I left it there to see if it sorts out
- the system rebooted and finished configuring the boot packages - the sigkills still occur with the updated kernel and a cmake built from source in debug mode i wonder if it relates to the user. maybe everything running under the user is killed due to some security thing? dunno
participants (1)
-
k