Lol, author's thought process mirrored mine as I read the article, as I was reading I was thinking, 'doesn't kqueue support that?... and then a section on kqueue. Then I was thinking to myself, so how does the Linux implementation do it then?... was just about to start trawling the source code when 'A parenthesis..'
Great article. Sorry to say though, Windows does manage all this in a more consistent way - but I guess they had the benefit of a clean slate.
signalfd / process descriptiors are the Windows style mechanism... what is missing are a few things like 'spawn' that returns a fd directly (eliminating races...)
Start both children, then call wait(), which blocks until any child exits and returns the pid of the child that exited. If it's the command child, then your command finished. If it's the other child, then the timeout expired.
Now that one child has exited, kill() the other child with SIGTERM and reap it by calling wait() again.
All of this assumes you'll only have these two children going, but if you're writing a small exponential backoff command retry utility, that should be OK.
In the early days of android i had an app that had to do video transcoding yet often hit oom on startup (reported via telemetry) even when the phone should have enough memory. This was before android had any video transcoding built in (2.3 days).
The solution was to spawn a child process, use memory in a loop, catch the sigkill in the parent, yield to the os as it killed other processes to free memory in the device as a whole and then on return from sleep in the parent process after killing the child start the video transcoding.
Hopefully this hack is not needed but if you want android to proactively run its process killing job so your app starts with maximum free memory the above worked!
From what I've heard, this is also how most of the "memory cleaner" apps work on most platforms - use memory in a loop so the system starts dropping various caches and housekeeping tasks and swapping backgroud processes, then exit so the memory is reclaimed.
An interesting aspect of waitid is that it allows you to access the full exit code of the process (i.e., the entire int instead of just the bottom 8 bits).
Unfortunately, many operating systems implement waitid() on top of one of the older APIs, meaning the top bits get lost regardless…
I wrote a crate https://crates.io/crates/swaperooni for similar use cases some time ago. I only gave the article a cursory scan, and can clearly see much deeper thought given here. Can't wait to dig in after work and learn a little bit.
Edit: by threads I mean creating a new thread to wait for the process, and then kill the process after a certain timeout if the process hasn't terminated. I guess I'm spoiled by Go...
2. That thread starts a child process and signals "started" by storing its PID somewhere globally-visible (and hopefully atomic/lock-protected).
3. The thread then blocks in wait(2), taking advantage of its non-main-thread-ness to avoid some signals and optionally masking/ignoring some more.
4. When the process exits, the thread can write exitstatus/"completed" to the globally-visible state next to PID. The thread then exits.
3. External observers wait for the process with a timeout by attempting to join the thread with a timeout. If the timeout occurs, they can access the globally-visible PID and send a signal to it.
This is missing from the article (EDIT: it has since been added, thanks!). That doesn't mean it's a good solution on many platforms. It's more costly in resources (thread stack), more code than most of the listed options, vulnerable to PID-reuse problems that can cause a killsignal to go to the wrong process, likely plays poorly with spawning methods that request a SIGCHLD be sent to the parent on exit (and plays poorly with signals in general if any customization is needed there), and is probably often slower than most of TFA's alternatives as well, both due to syscall count and pessimal thread/scheduler switching conditions. Additionally, it multiplexes/composes to large numbers of processes poorly and with a high resource cost.
EDIT: Golang's version of this is less bad than described above, but not perfect. Go's spawning infrastructure mitigates resource cost (goroutines/segmented stacks are not as heavy as threads), is vulnerable to PID-reuse (as are most platforms' operations in this area), addresses the SIGCHLD risk through the runtime and signal channels, and mitigates slowness with a very good scheduler. For multiplexing, I would assume (but I have not verified) that the Go runtime is internally using pidfds/kqueue where supported. Where not supported, I would assume Go is internally tracking spawn requests through its stdlib, handling SIGCHLD, and has a single global routine calling wait(2) without a specific PID, waking goroutines waiting on a watched PID when it comes out of the call to wait(2).
Thanks. I believe that Go indeed _could_ use those APIs to wait for the child more efficiently if they chose to, but the current implementation suggests that they're just calling wait4() in a separate thread: https://cs.opensource.google/go/go/+/refs/tags/go1.23.3:src/...
To be fair, in Go process spawning is very inefficient to begin with, since it requires lots of runtime coordination to not mess with the threads/goroutines state during fork, so running wait4() in a separate thread (although the thread can be re-used afterwards) is not the biggest concern here.
Thanks for this great article, it is going to be very useful for my project. I am currently developing an open source Android native app that invokes rsync when a file gets closed (ie: you take a picture)
> I would prefer extending poll to support things other than file descriptors, instead of converting everything a file descriptor to be able to use poll.
Why? The ability to block on these descriptors as a one off rather than wrapping into a poll makes them extremely useful and avoids the race issues that exist with signal handlers and other non-blocking mechanisms.
signalfd, timerfd, eventfd, userfaultfd, pidfd are all great applications of this strategy.
That is K&R C syntax, supported up to C18. The solution to tools emitting unwanted diagnostics is not to appease them with pointless cruft but to shut off the diagnostic:
> Because the Linux kernel coalesces SIGCHLD (and other signals), the only way to reliably determine if a monitored process has exited, is to loop through all PIDs registered by any kqueue when we receive a SIGCHLD. This involves many calls to waitid(2) and may have a negative performance impact.
This is somewhat wrong. To speed things up in the happy case (where we are the only part of the program that is spawning children), you can just do a `WNOHANG` wait for any child first, and check if it's one of the children we care about. Only if it's an unknown child do you have to do the full loop (of course, if you only have a couple of children the loop may be better).
0 is valid fd, so I recommend initializing fds to -1.
signalfd was just off-hand mentioned, but for writing anything larger, like lets say a daemon process, it keeps things close to all the other events being
reacted to. E.g.
#include <signal.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/timerfd.h>
#include <sys/signalfd.h>
#include <sys/epoll.h>
static int signalfd_init(void)
{
sigset_t sigs, oldsigs;
int sfd = -1;
sigemptyset(&sigs);
sigemptyset(&oldsigs);
sigaddset(&sigs, SIGCHLD);
if (!sigprocmask(SIG_BLOCK, &sigs, &oldsigs))
{
sfd = signalfd(-1, &sigs, SFD_CLOEXEC | SFD_NONBLOCK);
if (sfd != -1)
{
// Success
return sfd;
}
else
{
perror("signalfd");
}
sigprocmask(SIG_SETMASK, &oldsigs, NULL);
}
else
{
perror("sigprocmask");
}
return -1;
}
static int timerfd_init(void)
{
int tfd = timerfd_create(CLOCK_MONOTONIC, TFD_NONBLOCK | TFD_CLOEXEC);
if (tfd != -1)
{
struct itimerspec tv =
{
.it_value =
{
.tv_sec = 5
}
};
if (!timerfd_settime(tfd, 0, &tv, NULL))
{
return tfd;
}
else
{
perror("timerfd_settime");
}
close(tfd);
}
else
{
perror("timerfd_create");
}
return -1;
}
static int epoll_init(int sfd, int tfd)
{
int efd;
if (!sfd || !tfd)
{
return -1;
}
efd = epoll_create1(EPOLL_CLOEXEC);
if (efd != -1)
{
struct epoll_event ev[2] =
{
{
.events = EPOLLIN,
.data =
{
.fd = sfd,
}
},
{
.events = EPOLLIN,
.data =
{
.fd = tfd
}
}
};
if (!epoll_ctl(efd, EPOLL_CTL_ADD, sfd, &ev[0]) &&
!epoll_ctl(efd, EPOLL_CTL_ADD, tfd, &ev[1]))
{
return efd;
}
else
{
perror("epoll_ctl");
}
close(efd);
}
else
{
perror("epoll_create1");
}
return -1;
}
int main(int argc, char *argv[])
{
int exit_value = EXIT_FAILURE;
int sfd = signalfd_init(),
tfd = timerfd_init(),
efd = epoll_init(sfd, tfd);
if (sfd != -1 && tfd != -1 && efd != -1)
{
int child_pid = fork();
if (child_pid != -1)
{
if (!child_pid)
{
argv += 1;
if (-1 == execvp(argv[0], argv)) {
exit(EXIT_FAILURE);
}
__builtin_unreachable();
}
else
{
int err;
struct epoll_event ev;
while ((err = epoll_wait(efd, &ev, 1, -1)) > 0)
{
if (ev.data.fd == tfd)
{
// Read the signalfd for the possible SIGCHLD and
exit_value = EXIT_SUCCESS;
}
else if (ev.data.fd == tfd)
{
// Timer triggered, kill the child process.
}
}
if (err == -1)
{
perror("epoll_wait");
}
}
}
else
{
perror("fork");
}
}
close(sfd);
close(tfd);
close(efd);
exit(exit_value);
}
I know this is related but maybe someone smarter than me can explain how closely it relates (or doesn’t) to this issue which seems more general (iirc Cantrill was talking about fs events not child processes generally)
Great article. Sorry to say though, Windows does manage all this in a more consistent way - but I guess they had the benefit of a clean slate.
the pid will not be reused until you either handle sigchld or wait
Child 1 exec()s the command.
Child 2 does this:
Start both children, then call wait(), which blocks until any child exits and returns the pid of the child that exited. If it's the command child, then your command finished. If it's the other child, then the timeout expired.Now that one child has exited, kill() the other child with SIGTERM and reap it by calling wait() again.
All of this assumes you'll only have these two children going, but if you're writing a small exponential backoff command retry utility, that should be OK.
The solution was to spawn a child process, use memory in a loop, catch the sigkill in the parent, yield to the os as it killed other processes to free memory in the device as a whole and then on return from sleep in the parent process after killing the child start the video transcoding.
Hopefully this hack is not needed but if you want android to proactively run its process killing job so your app starts with maximum free memory the above worked!
The lineage of tools descending from daemontools for service management is worth exploring:
daemontools: http://cr.yp.to/daemontools.html
runit: https://smarden.org/runit/
s6: https://skarnet.org/software/s6/
dinit: https://davmac.org/projects/dinit/
https://www.man7.org/linux/man-pages/man3/io_uring_prep_wait...
Unfortunately, many operating systems implement waitid() on top of one of the older APIs, meaning the top bits get lost regardless…
Dunking on my crate is welcomed :)
Edit: by threads I mean creating a new thread to wait for the process, and then kill the process after a certain timeout if the process hasn't terminated. I guess I'm spoiled by Go...
1. Start a thread
2. That thread starts a child process and signals "started" by storing its PID somewhere globally-visible (and hopefully atomic/lock-protected).
3. The thread then blocks in wait(2), taking advantage of its non-main-thread-ness to avoid some signals and optionally masking/ignoring some more.
4. When the process exits, the thread can write exitstatus/"completed" to the globally-visible state next to PID. The thread then exits.
3. External observers wait for the process with a timeout by attempting to join the thread with a timeout. If the timeout occurs, they can access the globally-visible PID and send a signal to it.
This is missing from the article (EDIT: it has since been added, thanks!). That doesn't mean it's a good solution on many platforms. It's more costly in resources (thread stack), more code than most of the listed options, vulnerable to PID-reuse problems that can cause a killsignal to go to the wrong process, likely plays poorly with spawning methods that request a SIGCHLD be sent to the parent on exit (and plays poorly with signals in general if any customization is needed there), and is probably often slower than most of TFA's alternatives as well, both due to syscall count and pessimal thread/scheduler switching conditions. Additionally, it multiplexes/composes to large numbers of processes poorly and with a high resource cost.
EDIT: Golang's version of this is less bad than described above, but not perfect. Go's spawning infrastructure mitigates resource cost (goroutines/segmented stacks are not as heavy as threads), is vulnerable to PID-reuse (as are most platforms' operations in this area), addresses the SIGCHLD risk through the runtime and signal channels, and mitigates slowness with a very good scheduler. For multiplexing, I would assume (but I have not verified) that the Go runtime is internally using pidfds/kqueue where supported. Where not supported, I would assume Go is internally tracking spawn requests through its stdlib, handling SIGCHLD, and has a single global routine calling wait(2) without a specific PID, waking goroutines waiting on a watched PID when it comes out of the call to wait(2).
To be fair, in Go process spawning is very inefficient to begin with, since it requires lots of runtime coordination to not mess with the threads/goroutines state during fork, so running wait4() in a separate thread (although the thread can be re-used afterwards) is not the biggest concern here.
https://github.com/aguaviva/Syncy
Why? The ability to block on these descriptors as a one off rather than wrapping into a poll makes them extremely useful and avoids the race issues that exist with signal handlers and other non-blocking mechanisms.
signalfd, timerfd, eventfd, userfaultfd, pidfd are all great applications of this strategy.
So this would be a way which predates' C23's maybe_unused attribute¹
Nice trick
[1] https://en.cppreference.com/w/c/language/attributes/maybe_un...
This is somewhat wrong. To speed things up in the happy case (where we are the only part of the program that is spawning children), you can just do a `WNOHANG` wait for any child first, and check if it's one of the children we care about. Only if it's an unknown child do you have to do the full loop (of course, if you only have a couple of children the loop may be better).
signalfd was just off-hand mentioned, but for writing anything larger, like lets say a daemon process, it keeps things close to all the other events being reacted to. E.g.
https://youtu.be/l6XQUciI-Sc?t=3643
I know this is related but maybe someone smarter than me can explain how closely it relates (or doesn’t) to this issue which seems more general (iirc Cantrill was talking about fs events not child processes generally)