I love this. Please never stop doing what you’re doing.
edit: Of course you’re the top contributor to IncludeOS. That was the first project I thought of while reading this blog post. I’ve been obsessed with the idea of Network Function Virtualization for a long time. It’s the most natural boundary for separating units of work in a distributed system and produces such clean abstractions and efficient scaling mechanisms.
(I’m also a very happy user of Varnish in production btw. It’s by far the most reliable part of the stack, even more than nginx. Usually I forget it’s even there. It’s never been the cause of a bug, once I got it configured properly.)
What I like most is the ability to instantly reset the state of the VM to a known predefined state. It's like restarting the VM without any actual restart. It looks like an ideal course of action for network-facing services that are constantly under attack: even if an attack succeeds, the result is erased on the next request.
Easy COW page sharing for programs that are not written with that in mind, like ML model runners, is also pretty nice.
It also sounds ideal for resuming memory intensive per-user programs, like LLMs with a large context window. You can basically have an executable (and its memory) attached to a user session, but only pay the cost for it while the user session has an open request.
> TinyKVM can fork itself into copies that use copy-on-write to allow for huge workloads like LLMs to share most memory. As an example, 6GB weights required only 260MB working memory per instance, making it highly scalable.
Not entirely what this is intended for, but does anyone have experience running an X server (or Wayland, I don't care)?
I'm doing some dev (on Mac) against RDP server and occasionally have other needs like that for a client. Currently I use UTM (nice QEMU Mac frontend) along with a DietPi (super stripped-down Debian) VM for these sorts of things.
I'm pretty familiar with Docker, but have a good idea of what sorts of hoop-jumping might be needed to get a graphics server to run there. Wondering if there's a simpler path.
This is really exciting. The 2.5us snapshot restore performance is on a par with Wasmtime but with the huge advantage of being able to run native code, albeit with the disadvantage of much slower but still microsecond interop.
I see there is a QuickJS demo in the tinykvm_examples repo already but it'd be great to see if it's possible to get a JIT capable JavaScript runtime working as that will be an order of magnitude faster. From my experiments with server rendering a React app native QuickJS was about 12-20ms while v8 was 2-4ms after jit warmup.
I need to study this some more but I'd love to get to the point where there was a single Deno like executable that ran inside the sandbox and made all http requests through Varnish itself. A snapshot would be taken after importing the specified JS URl and then each request would run in an isolated snapshot.
Probably needs a mechanism to reset the random seed per request.
You can run v8 jitless, if you want. It's going to be much faster than QuickJS. Adding JIT support means adding a fixed executable range, which you also can do already, but you can't run it in the dumb CLI example. JITs love to be W+X. So, not sure if it's an afternoon amount of work yet, due to security implications.
I have experience with this from libriscv, where I also embed JIT run-times like v8 and LuaJIT already.
Fascinating but I'm having trouble understanding the big picture. This runs a user process in a VM with no kernel? Does every system call become a VM exit and get proxied to the host? Or are there no system calls?
It’s a bit more than running a program under seccomp strict mode, but conceptually similar, so running anything too complicated likely won't work. You certainly won’t be able to sandbox chromium for taking website snapshots for example.
There's many ways to go about it, but essentially yes, brk and mmap and a few others just to get into main() for some common run-times.
But you can do whatever you want. For example in libriscv I override the global allocator in my guest programs to use a host-managed heap. That way heap usage has native performance in interpreter mode, while also allowing me full control of the heap from the outside. I wrote about this here: https://medium.com/@fwsgonzo/using-c-as-a-scripting-language...
For the Varnish integration I added permission-based access to local files. Network stuff can be accessed through custom APIs. A simple fetch(url, options)-like system call. Just have a look at the VMOD repository. It's something I'd like to move into TinyKVM when I feel like it.
What do you mean by I/O exactly? Because to me handling HTTP requests definitely requires I/O, no matter how you technically implement it. Does the program start anew with new arguments for each HTTP request, and if so how is that an improvement over I/O syscalls?
I mean you don't get to open files, sockets, devices, etc. in the sandboxed program. You get to do just a few minimal things like I/O on stdin/stdout/stderr, use shared memory, maybe allocate memory.
Yep. I could imagine a deterministic method of just sending the executable + changed pages. Then load the program in the same way on the other machine, and then apply the changed pages. It would be a minimal transfer. Thread state can also be migrated, but Linux-kernel stuff like FDs cannot or at least, that's not my area of expertise!
There was Condor for this[1], a couple of decades ago. Condor would checkpoint the process and restart it on another machine entirely user-level (but requiring processes to link to their library) by continuing to forward system calls. It of course had plenty of limitations, and some of their decisions would be considered serious security risks now (e.g. they intercept open() and record the name, and assume that its safe to reopen a file by the same name after migration), but it was an interesting system.
I think migrating cooperating processes would be fairly simple, and the big challenge is rather to decide on the right set of tradeoffs.
I don’t see why not; over ten years ago the OpenVZ vm code had a way to rsync a container across the network; syncing everything; then only the pages that had changed since the start of sync; then the final pages that had changed in the last few seconds. There was a tiny delay to pause the container on the old and start on the new host; but I am sure that this could be reduced further.
It's rather something that sits between WebAssembly and containers, combining the sandboxing guarantees of the former with the performance of the latter. From a security perspective, the composition is also really good (WebAssembly enforces memory limits, but doesn't have memory protection, NULL pointers are writable, etc. and this is solved here). But unlike WebAssembly, it is Linux-only. So, not something that can run in Web browsers.
I mean. It's built on KVM and integrates deeply with how processes work; I'm not sure it's possible to make it portable without a lot of engineering time, performance hit, or both.
it's worked for "run this docker to use this code" sort of things on windows for me. That's all i use it for, it's an inconvenience. Docker, that is. Not docker on windows. Docker in general.
edit: Of course you’re the top contributor to IncludeOS. That was the first project I thought of while reading this blog post. I’ve been obsessed with the idea of Network Function Virtualization for a long time. It’s the most natural boundary for separating units of work in a distributed system and produces such clean abstractions and efficient scaling mechanisms.
(I’m also a very happy user of Varnish in production btw. It’s by far the most reliable part of the stack, even more than nginx. Usually I forget it’s even there. It’s never been the cause of a bug, once I got it configured properly.)
What I like most is the ability to instantly reset the state of the VM to a known predefined state. It's like restarting the VM without any actual restart. It looks like an ideal course of action for network-facing services that are constantly under attack: even if an attack succeeds, the result is erased on the next request.
Easy COW page sharing for programs that are not written with that in mind, like ML model runners, is also pretty nice.
> TinyKVM can fork itself into copies that use copy-on-write to allow for huge workloads like LLMs to share most memory. As an example, 6GB weights required only 260MB working memory per instance, making it highly scalable.
I'm doing some dev (on Mac) against RDP server and occasionally have other needs like that for a client. Currently I use UTM (nice QEMU Mac frontend) along with a DietPi (super stripped-down Debian) VM for these sorts of things.
I'm pretty familiar with Docker, but have a good idea of what sorts of hoop-jumping might be needed to get a graphics server to run there. Wondering if there's a simpler path.
You can find a bunch of posts related to this topic there as well.
I see there is a QuickJS demo in the tinykvm_examples repo already but it'd be great to see if it's possible to get a JIT capable JavaScript runtime working as that will be an order of magnitude faster. From my experiments with server rendering a React app native QuickJS was about 12-20ms while v8 was 2-4ms after jit warmup.
I need to study this some more but I'd love to get to the point where there was a single Deno like executable that ran inside the sandbox and made all http requests through Varnish itself. A snapshot would be taken after importing the specified JS URl and then each request would run in an isolated snapshot.
Probably needs a mechanism to reset the random seed per request.
I have experience with this from libriscv, where I also embed JIT run-times like v8 and LuaJIT already.
It’s a bit more than running a program under seccomp strict mode, but conceptually similar, so running anything too complicated likely won't work. You certainly won’t be able to sandbox chromium for taking website snapshots for example.
But you can do whatever you want. For example in libriscv I override the global allocator in my guest programs to use a host-managed heap. That way heap usage has native performance in interpreter mode, while also allowing me full control of the heap from the outside. I wrote about this here: https://medium.com/@fwsgonzo/using-c-as-a-scripting-language...
The VM may (and should) be limited to a small subset of what's available on the host though.
> The TinyKVM guest has a tiny kernel which cannot be modified.
Some notes from the post
> I found that TinyKVM ran at 99.7% native speed
> As long as they are static and don’t need file or network access, they might just run out-of-the box.
> The TinyKVM guest has a tiny kernel which cannot be modified
I’m exploring micro-VMs for my self-hosted PaaS, https://lunni.dev/ – and something with such little overhead seems like a really interesting option!
I think migrating cooperating processes would be fairly simple, and the big challenge is rather to decide on the right set of tradeoffs.
[1] https://chtc.cs.wisc.edu/doc/ckpt97.pdf
Agreed. That's a good way to sum it up.
With a read-only operating system that is identical across machines (i.e. NixOS or Silverblue), you would only have to send the dirty pages, too!
Would I use this to run a distributed infra on a server a bit like docker-compose? or it's not related?
It runs on bare metal, though. I just thought it was very interesting to see. Must have been a lot of work.
> hypervisor
> no VMs
Um?
https://hub.docker.com/r/microsoft/windows
My understanding is that it... Doesn't work all that well.