KVM
KVM Architecture
Details are given below
QEMU
Startup
main(): vl.c
|---runstate_init(): vl.c
| Initialize the transition table
|---init_clocks(): qemu-timer.c
| {rt|vm|host}_clock = qemu_new_clock(QEMU_CLOCK_{REALTIME|VIRTUAL|HOST})
|---rtc_clock = host_clock
|---os_setup_early_signal_handling(): os-posix.c
| Ignore SIG_PIPE
|---module_call_init(MODULE_INIT_MACHINE)
|---machine = find_default_machine(): vl.c
|---Parse options: memory, cdrom, enable-kvm, etc.
|---loc_set_none(): qemu-error.c
| Set the current location to "nowhere"
|---cpudef_init(): arch_init.c
| x86_cpudef_setup(): target-i386/cpu.c
| Model, etc.
|---data_dir = os_find_datadir(): os-posix.c
| Find the location of data files: bios, etc.
|---socket_init(): qemu-sockets.c
|---os_daemonize(): os-posix.c
|---configure_accelarator(): vl.c
| accelarator? = requested accelarator \cap available accelarators
| accelarator?.init()
| kvm_init(): kvm-all.c
| KVMState *s = new KVMState()
| s->slots[i].slot = i, \forall i, # number the 32 slots
| s->fd = open("/dev/kvm")
| ioctl(s->fd, KVM_GET_API_VERSION), see if we support the version
| ioctl(s->fd, KVM_CAP_MAX_VCPUS), get the number of max. # of cpus supported
| s->vmfd = ioctl(s->fd, KVM_CREATE_VM)
| s->coalesced_mmio = ioctl(s->fd, KVM_CAP_COALESCED_MMIO)
| ... more ioctls on s->fd to get capabilities: vcpu_events, singlestep, debugregs, xsave, xcrs, pit_state2, direct_msi
| kvm_arch_init(): target-i386/kvm.c
| kvm_get_supported_msrs()
| Setup BIOS for guest
| kvm_irqchip_create(): kvm-all.c
| kvm_init_irq_routing()
| gsi_count = kvm_check_extension(s, KVM_CAP_IRQ_ROUTING) = 1024
| kvm_arch_init_irq_routing(): target-i386/kvm.c
| memory_listener_register(): memory.c
| s->many_ioeventfds = kvm_check_many_ioeventfds()
| cpu_interrupt_handler = kvm_handle_interrupt
| Register the interrupt handler, kvm_handle_interrupt(): kvm-all.c
|---qemu_init_cpu_loop(): cpus.c
| qemu_init_sigbus()
| sigaction(SIGBUS, sigbus_handler, NULL)
| prctl(PR_MCE_KILL, PR_MCE_KILL_SET, PR_MCE_KILL_EARLY, 0, 0)
| Set the machine check memory corruption kill policy for the current
| thread so that it receives a SIGBUS signal as soon as hardware memory
| corruption is detected inside its address space.
| qemu_cond_init(qemu_{cpu|pause|work|io_proceeded|global}_cond
| qemu_thread_get_self(&io_thread): qemu-thread-posix.c
| thread->thread = pthread_self()
|---qemu_init_main_loop(): vl.c
| main_loop_init(): main-loop.c
| Lock io thread (which should be the same thread, from above)
| qemu_signal_init()
| Block the main thread from receiving: IPI, IO, ALRM, and BUS
| The signal fd gets triggered only for: IO, ALRM, and BUS
| qemu_event_init()
| Create two eventfds
| Set handler for fds[0] to qemu_event_read
| io_thread_fd = fds[1]
|---os_set_line_buffering(): os-posix.c
| Set line buffering on stdout
|---init_timer_alarm(): qemu-timer.c
| dynticks_start_timer(t)
| sigaction(SIGALRM, host_alarm_handler)
| t->timer = timer_create(CLOCK_REALTIME, SIGALRM, <main-thread>)
|---net_init_clients()
|---cpu_exec_init_all(): exec.c
| |---memory_map_init()
| | |---system_memory = g_malloc(sizeof(*system_memory))
| | | returns the size of struct MemoryRegion (which is 168 B)
| | |---memory_region_init(MemoryRegion *mr=system_memory, name="system", size=INT64_MAX): memory.c
| | |---set_system_memory_map(system_memory)
| | | update topology
| | |---system_io = g_malloc(sizeof(*system_io))
| | |---memory_region_init(system_io, "io", 65536)
| | |---set_system_io_map(system_io)
| | +---memory_listener_register({core|io}_memory_listener, system_{memory|io})
| +---io_map_init(): exec.c
| +-- memory_region_init_io(io_mem_{ram|rom|unassigned|notdirty|subpage|watch}, corresponding ops, names UINT64_MAX)
| memory_region_init(mr, name, size) # set generic parameters
| Change other parameters
|---bdrv_init_with_whitelist(): block.c
| bdrv_init()
|---blk_mig_init(): block-migration.c
| register_savevm_live(NULL, "block", 0, 1, &savevm_block_handlers, &block_mig_state)
|---qdev_machine_init(): qdev-monitor.c
| |---qdev_get_peripheral_anon()
| +---qdev_get_peripheral()
|---machine->init(ram_size, boot_devices, kernel_filename, kernel_cmdline, initrd_filename, cpu_model)
| +---pc_init_pci(): hw/pc_piix.c
| +---pc_init1(get_system_memory(), get_system_io(), ram_size, boot_device, kernel_filename, kernel_cmdline, initrd_filename, cpu_model, 1, 1): pc_piix.c
| |---pc_cpus_init(cpu_model): hw/pc.c
| | +---For each smp-cpu: pc_new_cpu(cpu_model)
| | |---cpu = cpu_x86_init(cpu_model): target-i386/helper.c
| | | |---cpu_x86_register(cpu, cpu_model)
| | | +---x86_cpu_realize(OBJECT(cpu), NULL): target-i386/cpu.c
| | | |---Register a reset callback for the cpu
| | | |---mce_init(cpu)
| | | | Machine Check Exception (MCE) is raised by the CPU wherever we are into the VMM or the VM and whichever the current instruction is
| | | +---qemu_init_vcpu(&cpu->env): cpus.c
| | | qemu_kvm_start_vcpu(env): cpus.c
| | | qemu_cond_init(env->halt_cond);
| | | qemu_thread_create(cpu->thread, qemu_kvm_cpu_thread_fn, env,
| | | QEMU_THREAD_JOINABLE);
| | |---cpu->env->apic_state = apic_init(env, env->cpuid_apic_id): hw/pc.c
| | | dev = qdev_create(NULL, "kvm-apic")
| | +-- cpu_reset(cpu)
| | x86_cpu_reset(): target-i386/cpu.c
| | tlb_flush()
| | update cr0, {idt|gdt|ldt|tr}_limit, eip, segments, etc. (# TODO)
| |---kvmclock_create(): clock.c
| | sysbus_create_simple("kvmclock", -1, NULL): sysbus.h
| |---memory_region_init(pci_memory, "pci", INT64_MAX)
| |---pc_memory_init(system_memory, kernel_filename, kernel_cmdline, initrd_filename, below_4g_mem_size, above_4g_mem_size, pci_enabled? rom_memory, ram_memory): hw/pc.c
| | memory_region_init_ram(ram, "pc.ram", below_4g_mem_size + above_4g_mem_size) # allocate ram as a single block; use pointers to subregions
| | vmstate_register_ram_global(ram): savevm.c # register this region
| | pc_system_firmware_init(rom_memory): hw/pc_sysfw.c
| |---kvm_piix3_setup_irq_routing(): hw/pc_piix.c
| | Add irq routes
| |---qemu_allocate_irqs() # allocate IRQState structs
| |---pci_bus = i440fx_init(): hw/piix_pci.c # initialize the host-to-pci bridge and the isabus
| |---kvm_i8259_init(): kvm/i8259.c # initialize the master and slave 8259 PICs on the ISA bus
| |---ioapic_init(gsi_state)
| |---ps_vga_init()
| |---pc_basic_device_init(isa, gsi, rtc, floppy): hw/pc.c
| |---pci_nic_init_nofail(nd, "e1000", NULL) # initialize the network device
| |---pci_piix3_ide_init(pci_bus, hd, ...) # initialize the hdd on IDE
| |---audio_init(isa_bus)
| |---pc_cmos_init(): hw/pc.c
| |---initialize the smbus
| +---pc_pci_device_init(pci_bus)
|---cpu_synchronize_all_post_init(): cpus.c
| For each cpu: cpu_synchronize_post_init(): kvm.h
| +---kvm_cpu_synchronize_post_init(env): kvm-all.c
| |---kvm_arch_put_registers(env, KVM_PUT_FULL_STATE)
| | Put all regs (xsave, xcrs, sregs, msrs), inject mce, mpstate, apic, vcpu events, debugregs
| +---env->kvm_vcpu_dirty = 0
|---set_numa_modes(): # recall there were two NUMA modes
|---net_check_clients(): net.c # sanity check for options mismatch, and initialization
|---ds = get_displaystate()
|---os_setup_signal_handling(): os-posix.c
| Set up termsig_hanlder to handle the following signals: INT, HUP, TERM
|---vnc_display_init(): ui/vnc.c
| vnc_start_worker_thread()
| register_displaychangelistener()
|---dpy_resize(ds)
|---qdev_machine_creation_done(): qdev.c
| Initial machine setup is finished; set flag to allow only hotplug
|---rom_load_all(): hw/loader.c
|---qemu_register_reset()
| Register the reset all function for the main_system_bus
|---qemu_run_machine_init_done_notifiers(): vl.c
| Setup boot sequence for the bios
|---qemu_system_reset(VMRESET_SILENT)
| cpu_synchronize_all_post_reset(): cpus.c
| For each cpu: cpu_synchronize_post_reset(): kvm.h
| kvm_cpu_synchronize_post_reset(env)
| kvm_arch_put_registers(env, KVM_PUT_RESET_STATE);
| env->kvm_vcpu_dirty = 0
|---Check for incoming traffic, otherwise
|---vm_start(): vl.c
| cpu_enable_ticks()
| runstate_set(RUN_STATE_RUNNING)
| vm_state_notify(1, RUN_STATE_RUNNING)
| resume_all_vcpus()
| qemu_cpu_kick(env): cpus.c
| qemu_cpu_kick_thread(env): cpus.c
| pthread_kill(cpu->thread->thread, SIG_IPI)
|---os_setup_post()
|---resume_all_vcpus()
+---main_loop()
Run
Savevm/Loadvm
Savevm
do_savevm(): savevm.c
vm_stop(RUN_STATE_SAVE_VM)
do_vm_stop(RUN_STATE_RESTORE_VM): cpus.c
cpu_disable_ticks()
Disable timers
pause_all_vcpus()
runstate_set()
Change the state machine
vm_state_notify()
Flush all block devices
qemu_savevm_state(): savevm.c
qemu_savevm_state_blocked():
Check if there is a device that doesn't support migration
qemu_savevm_state_begin(): savevm.c
Call the savevm_handlers for each device
Add a section to the snapshot file
qemu_savevm_state_iterate()
cpu_synchronize_all_states(): cpus.c
cpu_synchronize_state(): kvm.h
kvm_cpu_synchronize_state()
run_on_cpu(env, do_kvm_cpu_synchronize_state, env)
kvm_arch_get_registers(): kvm.c
vm_start()
cpu_enable_ticks()
runstate_set(RUN_STATE_RUNNING)
vm_state_notify(1, RUN_STATE_RUNNING)
resume_all_vcpus()
Loadvm
do_loadvm(): monitor.c
vm_stop(RUN_STATE_RESTORE_VM): cpus.c
load_vmstate(): savevm.c
bdrv_drain_all()
Flush existing buffers
qemu_system_reset(VMRESET_SILENT): vl.c
cpu_synchronize_all_post_reset(): cpus.c
qemu_loadvm_state(): savevm.c
For each saved section in the VM snapshot file:
vmstate_load()
Call loadvm_handler()
cpu_synchronize_all_post_init()
vm_start()
Data structures
struct kvm_memory_region {
__u33 slot;
__u32 flags;
__u64 guest_phys_addr;
__u64 memory_size; /* bytes */
}
struct kvm_userspace_memory_region {
__u32 slot; // Corresponds to KVMSlot.slot
__u32 flags; // Corresponds to KVMSlot.flags
__u64 guest_phys_addr; // Corresponds to KVMSlot.start_addr
__u64 memory_size; // Corresponds to KVMSlot.memory_size
__u64 userspace_addr; // Corresponds to KVMSlot.ram
}
struct KVMSlot {
target_phys_addr_t start_addr;
ram_addr_t memory_size;
void *ram;
int slot;
int flags;
}
struct KVMState {
KVMSlot slots[32];
...
};
/* for KVM_GET_DIRTY_LOG */
struct kvm_dirty_log {
__u32 slot;
__u32 padding1;
union {
void *dirty_bitmap; /* one bit per page */
__u64 padding2;
};
};
Migration
- Each device registers save/load callbacks; set up the destination.
- Transfer entire memory from source to destination
- Ask the kernel to start dirty page logging for all the slots
- Ask the kernel for dirty page bitmap
- Transfer the dirty pages; repeat until convergence
- Stop the source VM
- Transfer the device states
References
- Qemu Documentation
- Qemu Memory API
- KVM Live Migration (pdf)
- Discover the Linux KVM
- Updates to the KVM API?
- The Myth of Type I and Type II Hypervisors
- KVM MMU Virtualization (pdf)
- KVM w/ Intel VT (pdf)
Footnotes
[^fn-vm86] vm86 mode: Virtual 8086 mode (also called virtual real mode, V86-mode or VM86) allows the execution of “real mode” applications (eg.: DOS, BIOS) that are incapable of running directly in protected mode while the processor is running a protected mode operating system.
[^fn-e820] E820: e820 is shorthand to refer to the facility by which the BIOS of x86-based computer systems reports the memory map to the operating system or boot loader. It is accessed via the int 15h call, by setting the AX register to value E820 in hexadecimal. It reports which memory address ranges are usable and which are reserved for use by the BIOS.
[^fn-ept] EPT: Extended Page Table. Intel second generation x86 virtualization technology for the memory management unit (MMU).
[^fn-pit] PIT: Programmable Interval Timer. 8253/8254
[^fn-hpet] HPET: High Precision Event Timer. Specifications (pdf)
[^fn-mce] MCE: Machine Check Exception.
[^fn-pic] PIC: Programmable Interrupt Controller. 8259
[^fn-iopic] I/O APIC: Advanced Programmable Interrupt Controller.
Why is checkpointing faster than savevm/loadvm?
- Data-structs are being stored as is, and, hence, can be restored directly, without the overhead of copying them from different places within the address space.