Review: Dune – Safe User-Level Access To Privileged CPU Features

Dune is a system that folks at Stanford came up with wherein they took Intel’s VT-x technology and instead of running full-blown VMs, they tried exposing that capability to an untrusted userspace program via a library. This results in exposing a virtualization abstraction at the process level rather than the machine level, allowing various page table manipulation optimizations to be done by userspace.

First, some background about Intel VT-x. Before virtualization really took off, hypervisors operated using binary translation wherein they would “translate” any privileged instructions that the guest invoked using a binary translator. The x86 platform was difficult to virtualize due to the nature of the many privileged instructions that it allows for paging and segmentation. After virtualization began to come off age, hardware vendors like Intel and AMD came up with their own very similar designs to provide assistance to hypervisor developers. Intel’s VT-x and AMD’s AMD-V are technologies that allowed guests to run in a unprivileged mode but with complete access to a shadow copy of the privileged registers for their domain. The hypervisor did not need to trap and translate these special instructions anymore thus reducing the number of VM exits that were required.

Intel’s VT-x differentiates between VMX root mode and VMX non root mode. The VMX root mode is mostly symbolic of the hypervisor mode, in which the kernel runs. VMX non root mode is designed for guest operating systems to run. Both modes have their own privilege levels (Rings 0 – 3). When in VMX non root mode, the hardware maintains all the registers and state changes that can be caused by privileged instructions executing in the guest. This allows the hypervisor to not have to trap into most guest operations and reduces the number of VM exits. Guests can perform privileged operations like modifying page tables, loading IDT / GDT / LDT etc. Exceptions that occur in non VMX root mode are also handled in hardware and are delivered directly to the guest thus reducing the latency and overhead considerably.

Dune basically uses VT-x to run userlevel applications instead of guest kernels. The libDune library acts as a mediator that runs in VMX non root ring 0 and abstracts out many of the POSIX layer system calls that applications may depend on. Remember that a system call from ring 3 non VMX root mode will trap back into ring 0 within the non VMX root domain itself. Some system calls must go out into the kernel, and for this, libDune uses the VMCALL (aka hypercall) to call into the hypervisor’s kernel. Trapping out into the hypervisor is a fairly expensive operation since it requires saving and restoring the complete VMCS (VM control structure) which can be fairly large. Dune is able to optimize a lot of the data that needs to be saved since its aware that the container is not running a full blown VM but an application. Dune also restricts access to certain MSRs and registers so as to prevent the performance penalty of saving and restoring them. Dune processes can also batch up page table manipulations and TLB invalidations since they have complete access to them.

The Dune framework is interesting from a research perspective, although there seem to be a number of challenges for implementing this in a real world system. For one, system call overhead for an application is fairly high. getpid() is considered one of the simplest system calls that a process can make; on Linux, this takes ~ 138 cycles whereas on Dune, the same took ~ 850 cycles. Also, applications might need to be re-written to benefit from Dune and there might be portability problems in moving an application from a non Dune environment. On the flip side, some applications might benefit immensely from running in the Dune environment – applications that need direct access to networking devices using VT-d, or want to implement better garbage collection using PTE information for example.


Leave a Reply

Your email address will not be published. Required fields are marked *