根据 kvmtool github仓库文档的描述,类似于QEMU,kvmtool是一个承载KVM Guest OS的 host os用户态虚拟机,作为一个纯的完全虚拟化的工具,它不需要修改guest os即可运行, 不过,由于KVM基于CPU的硬件虚拟化支持,它只支持使用相同架构的Guest.
kvmtool提供了一个干净的、从头开始写的、轻量级虚拟化工具, 代码量只有5KLOC,由于轻量,对于想要学习虚拟化的人来说,非常的友好。kvmtool 作为KVM主机工具实现,可以引导Linux客户映像,无需BIOS和其他相关依赖. 下面我们尝试基于ubuntu22环境下搭建一个kvmtool运行环境,在虚拟机上运行另一个linux系统。
主机环境
本实验使用的主机系统是ubuntu22.04,具体信息参考下图:
下载代码
下载kvmtool:
$ git clone https://github.com/kvmtool/kvmtool.git
下载busybox:
$ wget https://busybox.net/downloads/busybox-1.32.0.tar.bz2
下载Linux内核:
$ axel -a -n 80 https://www.kernel.org/pub/linux/kernel/v5.x/linux-5.15.18.tar.gz
选择版本的时候,刻意选择工具和源码版本大体同一段时间的即可,无需太多关注。
编译kvmtool
本次实验使用的kvmtool版本为:e17d182ad3f797f01947fc234d95c96c050c534b,编译方式简单直接,进入 kvmtool目录下直接make 即可:
编译后的可执行程序为lkvm,同时建立了一个lkvm的硬连接vm.两者完全一致。
编译Linux内核
内核的编译方法很简单,参考博客
https://blog.csdn.net/tugouxp/article/details/117616804?spm=1001.2014.3001.5502
这里需要注意三点:
修掉.pem文件缺失相关的编译错误,有两个
只需要编译bzImage目标,不需要编译模块
默认menuconfig即可,已经打开了KVM,VIRTIO相关选项
最后生成bzImage文件:
编译busybox
基于busybox制作根文件系统,构建目录结构,参考博客:
https://blog.csdn.net/tugouxp/article/details/124434243
需要注意的是,执行完博客中的操作后,需要将顶层目录的linuxrc文件重命名为init.
之后将rootfs目录压缩为cpio文件。
$ find . | cpio -o --format=newc > root_fs.cpio
完成后目录结构如下:
以上三步操作完成后,就可以开始运行了。
运行虚拟机
执行前,确认主机存在/dev/kvm设备节点
运行虚拟机执行如下命令
$ sudo ./lkvm run -k ../linux-5.15.18/arch/x86/boot/bzImage -i ../busybox-1.32.0/_install/root_fs.cpio
zlcao@zlcao-RedmiBook-14:~/kvm/kvmtool$ sudo ./lkvm run -k ../linux-5.15.18/arch/x86/boot/bzImage -i ../busybox-1.32.0/_install/root_fs.cpio
# lkvm run -k ../linux-5.15.18/arch/x86/boot/bzImage -m 704 -c 8 --name guest-100110
[ 0.000000] Linux version 5.15.18 (zlcao@zlcao-RedmiBook-14) (gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #1 SMP Fri Jan 27 12:27:51 CST 2023
[ 0.000000] Command line: noapic noacpi pci=conf1 reboot=k panic=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 earlyprintk=serial i8042.noaux=1 console=ttyS0 root=/dev/vda rw
[ 0.000000] KERNEL supported cpus:
[ 0.000000] Intel GenuineIntel
[ 0.000000] AMD AuthenticAMD
[ 0.000000] Hygon HygonGenuine
[ 0.000000] Centaur CentaurHauls
[ 0.000000] zhaoxin Shanghai
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
[ 0.000000] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
[ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
[ 0.000000] x86/fpu: xstate_offset[3]: 832, xstate_sizes[3]: 64
[ 0.000000] x86/fpu: xstate_offset[4]: 896, xstate_sizes[4]: 64
[ 0.000000] x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.
[ 0.000000] signal: max sigframe size: 2032
[ 0.000000] BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000ffffe] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000002bffffff] usable
[ 0.000000] printk: bootconsole [earlyser0] enabled
[ 0.000000] ERROR: earlyprintk= earlyser already used
[ 0.000000] NX (Execute Disable) protection: active
[ 0.000000] DMI not present or invalid.
[ 0.000000] Hypervisor detected: KVM
[ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[ 0.000000] kvm-clock: cpu 0, msr 11c01001, primary cpu clock
[ 0.000004] kvm-clock: using sched offset of 198180346 cycles
[ 0.000522] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[ 0.002007] tsc: Detected 1992.002 MHz processor
[ 0.002444] last_pfn = 0x2c000 max_arch_pfn = 0x400000000
[ 0.002986] Disabled
[ 0.003182] x86/PAT: MTRRs disabled, skipping PAT initialization too.
[ 0.003765] CPU MTRRs all blank - virtualized system.
[ 0.004236] x86/PAT: Configuration [0-7]: WB WT UC- UC WB WT UC- UC
Memory KASLR using RDRAND RDTSC...
[ 0.005590] found SMP MP-table at [mem 0x000f03b0-0x000f03bf]
[ 0.006456] Using GB pages for direct mapping
[ 0.007160] RAMDISK: [mem 0x2bd00000-0x2bf83fff]
[ 0.007640] ACPI: Early table checksum verification disabled
[ 0.008311] ACPI BIOS Error (bug): A valid RSDP was not found (20210730/tbxfroot-210)
[ 0.009234] No NUMA configuration found
[ 0.009526] Faking a node at [mem 0x0000000000000000-0x000000002bffffff]
[ 0.010001] NODE_DATA(0) allocated [mem 0x2bfd6000-0x2bffffff]
[ 0.010937] Zone ranges:
[ 0.011122] DMA [mem 0x0000000000001000-0x0000000000ffffff]
[ 0.011581] DMA32 [mem 0x0000000001000000-0x000000002bffffff]
[ 0.012074] Normal empty
[ 0.012351] Device empty
[ 0.012626] Movable zone start for each node
[ 0.012971] Early memory node ranges
[ 0.013292] node 0: [mem 0x0000000000001000-0x000000000009efff]
[ 0.013732] node 0: [mem 0x0000000000100000-0x000000002bffffff]
[ 0.014192] Initmem setup node 0 [mem 0x0000000000001000-0x000000002bffffff]
[ 0.014710] On node 0, zone DMA: 1 pages in unavailable ranges
[ 0.014878] On node 0, zone DMA: 97 pages in unavailable ranges
[ 0.022910] On node 0, zone DMA32: 16384 pages in unavailable ranges
[ 0.023633] Intel MultiProcessor Specification v1.4
[ 0.024453] MPTABLE: OEM ID: KVMCPU00
[ 0.024719] MPTABLE: Product ID: 0.1
[ 0.025000] MPTABLE: APIC at: 0xFEE00000
[ 0.025279] Processor #0 (Bootup-CPU)
[ 0.025527] Processor #1
[ 0.025698] Processor #2
[ 0.025861] Processor #3
[ 0.026025] Processor #4
[ 0.026191] Processor #5
[ 0.026356] Processor #6
[ 0.026521] Processor #7
[ 0.026715] IOAPIC[0]: apic_id 9, version 17, address 0xfec00000, GSI 0-23
[ 0.027163] Processors: 8
[ 0.027344] smpboot: Allowing 8 CPUs, 0 hotplug CPUs
[ 0.027735] kvm-guest: KVM setup pv remote TLB flush
[ 0.028059] kvm-guest: setup PV sched yield
[ 0.028372] PM: hibernation: Registered nosave memory: [mem 0x00000000-0x00000fff]
[ 0.028859] PM: hibernation: Registered nosave memory: [mem 0x0009f000-0x0009ffff]
[ 0.029349] PM: hibernation: Registered nosave memory: [mem 0x000a0000-0x000effff]
[ 0.029843] PM: hibernation: Registered nosave memory: [mem 0x000f0000-0x000fefff]
[ 0.030330] PM: hibernation: Registered nosave memory: [mem 0x000ff000-0x000fffff]
[ 0.030820] [mem 0x2c000000-0xffffffff] available for PCI devices
[ 0.031217] Booting paravirtualized kernel on KVM
[ 0.031546] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns
[ 0.032234] setup_percpu: NR_CPUS:8192 nr_cpumask_bits:8 nr_cpu_ids:8 nr_node_ids:1
[ 0.034042] percpu: Embedded 61 pages/cpu s212992 r8192 d28672 u262144
[ 0.034524] kvm-guest: setup async PF for cpu 0
[ 0.034866] kvm-guest: stealtime: cpu 0, msr 2ae33080
[ 0.035203] kvm-guest: PV spinlocks enabled
[ 0.035483] PV qspinlock hash table entries: 256 (order: 0, 4096 bytes, linear)
[ 0.035994] Built 1 zonelists, mobility grouping on. Total pages: 177152
[ 0.036454] Policy zone: DMA32
[ 0.036658] Kernel command line: noapic noacpi pci=conf1 reboot=k panic=1 i8042.direct=1 i8042.dumbkbd=1 i8042.nopnp=1 earlyprintk=serial i8042.noaux=1 console=ttyS0 root=/dev/vda rw
[ 0.037994] Unknown kernel command line parameters "noacpi", will be passed to user space.
[ 0.039146] Dentry cache hash table entries: 131072 (order: 8, 1048576 bytes, linear)
[ 0.039968] Inode-cache hash table entries: 65536 (order: 7, 524288 bytes, linear)
[ 0.040621] mem auto-init: stack:off, heap alloc:on, heap free:off
[ 0.045493] Memory: 657968K/720504K available (16393K kernel code, 4387K rwdata, 10492K rodata, 2932K init, 4816K bss, 62276K reserved, 0K cma-reserved)
[ 0.046448] random: get_random_u64 called from __kmem_cache_create+0x2f/0x520 with crng_init=0
[ 0.046702] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1
[ 0.047702] ftrace: allocating 47928 entries in 188 pages
[ 0.064484] ftrace: allocated 188 pages with 5 groups
[ 0.065149] rcu: Hierarchical RCU implementation.
[ 0.065448] rcu: RCU restricting CPUs from NR_CPUS=8192 to nr_cpu_ids=8.
[ 0.065873] Rude variant of Tasks RCU enabled.
[ 0.066157] Tracing variant of Tasks RCU enabled.
[ 0.066456] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[ 0.066930] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=8
[ 0.070850] NR_IRQS: 524544, nr_irqs: 488, preallocated irqs: 16
[ 0.071549] random: crng done (trusting CPU's manufacturer)
[ 0.071979] Console: colour *CGA 80x25
[ 0.072283] printk: console [ttyS0] enabled
[ 0.072283] printk: console [ttyS0] enabled
[ 0.072969] printk: bootconsole [earlyser0] disabled
[ 0.072969] printk: bootconsole [earlyser0] disabled
[ 0.073921] APIC: Switch to symmetric I/O mode setup
[ 0.074351] Not enabling interrupt remapping due to skipped IO-APIC setup
[ 0.075319] kvm-guest: setup PV IPIs
[ 0.075970] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x396d566cf43, max_idle_ns: 881590760263 ns
[ 0.076947] Calibrating delay loop (skipped) preset value.. 3984.00 BogoMIPS (lpj=7968008)
[ 0.077665] pid_max: default: 32768 minimum: 301
[ 0.081003] LSM: Security Framework initializing
[ 0.081417] landlock: Up and running.
[ 0.081733] Yama: becoming mindful.
[ 0.082087] AppArmor: AppArmor initialized
[ 0.082481] Mount-cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
[ 0.083131] Mountpoint-cache hash table entries: 2048 (order: 2, 16384 bytes, linear)
Poking KASLR using RDRAND RDTSC...
[ 0.085044] x86/cpu: User Mode Instruction Prevention (UMIP) activated
[ 0.085971] Last level iTLB entries: 4KB 64, 2MB 8, 4MB 8
[ 0.086434] Last level dTLB entries: 4KB 64, 2MB 0, 4MB 0, 1GB 4
[ 0.086991] Spectre V1 : Mitigation: usercopy/swapgs barriers and __user pointer sanitization
[ 0.088961] Spectre V2 : Mitigation: Full generic retpoline
[ 0.089424] Spectre V2 : Spectre v2 / SpectreRSB mitigation: Filling RSB on context switch
[ 0.090121] Spectre V2 : Enabling Restricted Speculation for firmware calls
[ 0.090713] Spectre V2 : mitigation: Enabling conditional Indirect Branch Prediction Barrier
[ 0.091429] Spectre V2 : User space: Mitigation: STIBP via seccomp and prctl
[ 0.092018] Speculative Store Bypass: Mitigation: Speculative Store Bypass disabled via prctl and seccomp
[ 0.092950] SRBDS: Unknown: Dependent on hypervisor status
[ 0.093435] MDS: Mitigation: Clear CPU buffers
[ 0.101335] Freeing SMP alternatives memory: 40K
[ 0.318017] smpboot: CPU0: Intel 06/8e (family: 0x6, model: 0x8e, stepping: 0xb)
[ 0.319105] Performance Events: Skylake events, 32-deep LBR, full-width counters, Intel PMU driver.
[ 0.321782] ... version: 2
[ 0.322127] ... bit width: 48
[ 0.322481] ... generic registers: 4
[ 0.322819] ... value mask: 0000ffffffffffff
[ 0.323267] ... max period: 00007fffffffffff
[ 0.323719] ... fixed-purpose events: 3
[ 0.324941] ... event mask: 000000070000000f
[ 0.325598] rcu: Hierarchical SRCU implementation.
[ 0.327121] smp: Bringing up secondary CPUs ...
[ 0.327742] x86: Booting SMP configuration:
[ 0.328094] .... node #0, CPUs: #1
[ 0.009568] kvm-clock: cpu 1, msr 11c01041, secondary cpu clock
[ 0.329211] kvm-guest: setup async PF for cpu 1
[ 0.329667] kvm-guest: stealtime: cpu 1, msr 2ae73080
[ 0.330021] #2
[ 0.009568] kvm-clock: cpu 2, msr 11c01081, secondary cpu clock
[ 0.009568] [Firmware Bug]: CPU2: APIC id mismatch. Firmware: 2 APIC: 7
[ 0.331227] kvm-guest: setup async PF for cpu 2
[ 0.331227] kvm-guest: stealtime: cpu 2, msr 2aeb3080
[ 0.333172] #3
[ 0.009568] kvm-clock: cpu 3, msr 11c010c1, secondary cpu clock
[ 0.009568] [Firmware Bug]: CPU3: APIC id mismatch. Firmware: 3 APIC: 7
[ 0.334905] kvm-guest: setup async PF for cpu 3
[ 0.334905] kvm-guest: stealtime: cpu 3, msr 2aef3080
[ 0.334905] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[ 0.337190] #4
[ 0.009568] kvm-clock: cpu 4, msr 11c01101, secondary cpu clock
[ 0.009568] [Firmware Bug]: CPU4: APIC id mismatch. Firmware: 4 APIC: 1
[ 0.339458] kvm-guest: setup async PF for cpu 4
[ 0.339458] kvm-guest: stealtime: cpu 4, msr 2af33080
[ 0.341165] #5
[ 0.009568] kvm-clock: cpu 5, msr 11c01141, secondary cpu clock
[ 0.009568] [Firmware Bug]: CPU5: APIC id mismatch. Firmware: 5 APIC: 0
[ 0.343159] kvm-guest: setup async PF for cpu 5
[ 0.343159] kvm-guest: stealtime: cpu 5, msr 2af73080
[ 0.345078] #6
[ 0.009568] kvm-clock: cpu 6, msr 11c01181, secondary cpu clock
[ 0.009568] [Firmware Bug]: CPU6: APIC id mismatch. Firmware: 6 APIC: 7
[ 0.346579] kvm-guest: setup async PF for cpu 6
[ 0.346579] kvm-guest: stealtime: cpu 6, msr 2afb3080
[ 0.346579] #7
[ 0.009568] kvm-clock: cpu 7, msr 11c011c1, secondary cpu clock
[ 0.009568] [Firmware Bug]: CPU7: APIC id mismatch. Firmware: 7 APIC: 6
[ 0.349375] kvm-guest: setup async PF for cpu 7
[ 0.349375] kvm-guest: stealtime: cpu 7, msr 2aff3080
[ 0.349687] smp: Brought up 1 node, 8 CPUs
[ 0.349687] smpboot: Max logical packages: 1
[ 0.349897] smpboot: Total of 8 processors activated (31872.03 BogoMIPS)
[ 0.353085] devtmpfs: initialized
[ 0.353355] x86/mm: Memory block size: 128MB
[ 0.354192] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[ 0.354192] futex hash table entries: 2048 (order: 5, 131072 bytes, linear)
[ 0.354845] pinctrl core: initialized pinctrl subsystem
[ 0.357228] PM: RTC time: 05:49:54, date: 2023-01-27
[ 0.358851] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[ 0.359701] DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
[ 0.361013] DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA pool for atomic allocations
[ 0.361921] DMA: preallocated 128 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations
[ 0.362746] audit: initializing netlink subsys (disabled)
[ 0.363325] audit: type=2000 audit(1674798594.637:1): state=initialized audit_enabled=0 res=1
[ 0.363325] thermal_sys: Registered thermal governor 'fair_share'
[ 0.363325] thermal_sys: Registered thermal governor 'bang_bang'
[ 0.363325] thermal_sys: Registered thermal governor 'step_wise'
[ 0.364971] thermal_sys: Registered thermal governor 'user_space'
[ 0.365682] thermal_sys: Registered thermal governor 'power_allocator'
[ 0.366337] EISA bus registered
[ 0.367271] cpuidle: using governor ladder
[ 0.367616] cpuidle: using governor menu
[ 0.369047] PCI: Using configuration type 1 for base access
[ 0.371011] Kprobes globally optimized
[ 0.371378] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[ 0.371378] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
[ 0.373053] ACPI: Interpreter disabled.
[ 0.373350] iommu: Default domain type: Translated
[ 0.373350] iommu: DMA domain TLB invalidation policy: lazy mode
[ 0.376980] vgaarb: loaded
[ 0.377344] SCSI subsystem initialized
[ 0.377600] usbcore: registered new interface driver usbfs
[ 0.377600] usbcore: registered new interface driver hub
[ 0.377741] usbcore: registered new device driver usb
[ 0.378091] pps_core: LinuxPPS API ver. 1 registered
[ 0.378415] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
[ 0.379028] PTP clock support registered
[ 0.379316] EDAC MC: Ver: 3.0.0
[ 0.381070] NetLabel: Initializing
[ 0.381298] NetLabel: domain hash size = 128
[ 0.381582] NetLabel: protocols = UNLABELED CIPSOv4 CALIPSO
[ 0.381967] NetLabel: unlabeled traffic allowed by default
[ 0.382356] PCI: Probing PCI hardware
[ 0.382356] PCI host bridge to bus 0000:00
[ 0.382356] pci_bus 0000:00: root bus resource [io 0x0000-0xffff]
[ 0.382356] pci_bus 0000:00: root bus resource [mem 0x00000000-0x7fffffffff]
[ 0.382388] pci_bus 0000:00: No busn resource found for root bus, will use [bus 00-ff]
[ 0.383074] pci 0000:00:00.0: [1af4:1041] type 00 class 0x020000
[ 0.384986] pci 0000:00:00.0: reg 0x10: [io 0x6200-0x62ff]
[ 0.385384] pci 0000:00:00.0: reg 0x14: [mem 0xd2000000-0xd20000ff]
[ 0.385820] pci 0000:00:00.0: reg 0x18: [mem 0xd2000400-0xd20007ff]
[ 0.394166] pci_bus 0000:00: busn_res: [bus 00-ff] end is updated to 00
[ 0.394690] clocksource: Switched to clocksource kvm-clock
[ 0.407575] VFS: Disk quotas dquot_6.6.0
[ 0.407909] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[ 0.408491] AppArmor: AppArmor Filesystem Enabled
[ 0.408831] pnp: PnP ACPI: disabled
[ 0.410916] NET: Registered PF_INET protocol family
[ 0.411442] IP idents hash table entries: 16384 (order: 5, 131072 bytes, linear)
[ 0.412701] tcp_listen_portaddr_hash hash table entries: 512 (order: 1, 8192 bytes, linear)
[ 0.413465] TCP established hash table entries: 8192 (order: 4, 65536 bytes, linear)
[ 0.414181] TCP bind hash table entries: 8192 (order: 5, 131072 bytes, linear)
[ 0.414825] TCP: Hash tables configured (established 8192 bind 8192)
[ 0.415558] MPTCP token hash table entries: 1024 (order: 2, 24576 bytes, linear)
[ 0.416173] UDP hash table entries: 512 (order: 2, 16384 bytes, linear)
[ 0.416776] UDP-Lite hash table entries: 512 (order: 2, 16384 bytes, linear)
[ 0.417414] NET: Registered PF_UNIX/PF_LOCAL protocol family
[ 0.417903] NET: Registered PF_XDP protocol family
[ 0.418322] pci_bus 0000:00: resource 4 [io 0x0000-0xffff]
[ 0.418794] pci_bus 0000:00: resource 5 [mem 0x00000000-0x7fffffffff]
[ 0.419406] PCI: CLS 0 bytes, default 64
[ 0.419810] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x396d566cf43, max_idle_ns: 881590760263 ns
[ 0.419933] Trying to unpack rootfs image as initramfs...
[ 0.421289] clocksource: Switched to clocksource tsc
[ 0.421757] platform rtc_cmos: registered platform RTC device (no PNP device found)
[ 0.423248] Initialise system trusted keyrings
[ 0.423671] Key type blacklist registered
[ 0.424313] workingset: timestamp_bits=36 max_order=18 bucket_order=0
[ 0.426758] zbud: loaded
[ 0.427453] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[ 0.428192] fuse: init (API version 7.34)
[ 0.428890] integrity: Platform Keyring initialized
[ 0.430492] Freeing initrd memory: 2576K
[ 0.435013] Key type asymmetric registered
[ 0.435289] Asymmetric key parser 'x509' registered
[ 0.435621] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 243)
[ 0.436190] io scheduler mq-deadline registered
[ 0.436884] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
[ 0.438064] Serial: 8250/16550 driver, 32 ports, IRQ sharing enabled
[ 0.459372] serial8250: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a U6_16550A
[ 0.480907] serial8250: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a U6_16550A
[ 0.502620] serial8250: ttyS2 at I/O 0x3e8 (irq = 4, base_baud = 115200) is a U6_16550A
[ 0.505001] Linux agpgart interface v0.103
[ 0.508374] loop: module loaded
[ 0.509013] tun: Universal TUN/TAP device driver, 1.6
[ 0.509497] PPP generic driver version 2.4.2
[ 0.509993] VFIO - User Level meta-driver version: 0.3
[ 0.510593] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[ 0.511181] ehci-pci: EHCI PCI platform driver
[ 0.511689] ehci-platform: EHCI generic platform driver
[ 0.512245] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[ 0.512857] ohci-pci: OHCI PCI platform driver
[ 0.513377] ohci-platform: OHCI generic platform driver
[ 0.513947] uhci_hcd: USB Universal Host Controller Interface driver
[ 0.514683] i8042: PNP detection disabled
[ 0.515301] serio: i8042 KBD port at 0x60,0x64 irq 1
[ 0.516030] mousedev: PS/2 mouse device common for all mice
[ 0.516659] input: AT Raw Set 2 keyboard as /devices/platform/i8042/serio0/input/input0
[ 0.517669] rtc_cmos rtc_cmos: only 24-hr supported
[ 0.518179] i2c_dev: i2c /dev entries driver
[ 0.518713] device-mapper: core: CONFIG_IMA_DISABLE_HTABLE is disabled. Duplicate IMA measurements will not be recorded in the IMA log.
[ 0.520248] device-mapper: uevent: version 1.0.3
[ 0.521055] device-mapper: ioctl: 4.45.0-ioctl (2021-03-22) initialised: dm-devel@redhat.com
[ 0.522019] platform eisa.0: Probing EISA bus 0
[ 0.522587] eisa 00:00: EISA: Mainboard @@@0000 detected
[ 0.523175] eisa 00:01: EISA: slot 1: @@@0000 detected (disabled)
[ 0.523834] eisa 00:02: EISA: slot 2: @@@0000 detected (disabled)
[ 0.524514] eisa 00:03: EISA: slot 3: @@@0000 detected (disabled)
[ 0.525209] eisa 00:04: EISA: slot 4: @@@0000 detected (disabled)
[ 0.525882] eisa 00:05: EISA: slot 5: @@@0000 detected (disabled)
[ 0.526553] eisa 00:06: EISA: slot 6: @@@0000 detected (disabled)
[ 0.527228] eisa 00:07: EISA: slot 7: @@@0000 detected (disabled)
[ 0.527902] eisa 00:08: EISA: slot 8: @@@0000 detected (disabled)
[ 0.528538] platform eisa.0: EISA: Detected 8 cards
[ 0.529059] intel_pstate: CPU model not supported
[ 0.529779] ledtrig-cpu: registered to indicate activity on CPUs
[ 0.530507] intel_pmc_core intel_pmc_core.0: initialized
[ 0.531105] drop_monitor: Initializing network drop monitor service
[ 0.531923] NET: Registered PF_INET6 protocol family
[ 0.535712] Segment Routing with IPv6
[ 0.536127] In-situ OAM (IOAM) with IPv6
[ 0.536550] NET: Registered PF_PACKET protocol family
[ 0.537127] Key type dns_resolver registered
[ 0.538631] IPI shorthand broadcast: enabled
[ 0.539012] sched_clock: Marking stable (532988952, 5568579)->(558785633, -20228102)
[ 0.540219] registered taskstats version 1
[ 0.540813] Loading compiled-in X.509 certificates
[ 0.542057] Loaded X.509 cert 'Build time autogenerated kernel key: 25cc8cb7907826729975261abe82eb726e9a7e0c'
[ 0.544528] zswap: loaded using pool lzo/zbud
[ 0.545873] Key type ._fscrypt registered
[ 0.546290] Key type .fscrypt registered
[ 0.546703] Key type fscrypt-provisioning registered
[ 0.549064] Key type encrypted registered
[ 0.549676] AppArmor: AppArmor sha1 policy hashing enabled
[ 0.550410] ima: No TPM chip found, activating TPM-bypass!
[ 0.550989] Loading compiled-in module X.509 certificates
[ 0.551999] Loaded X.509 cert 'Build time autogenerated kernel key: 25cc8cb7907826729975261abe82eb726e9a7e0c'
[ 0.552950] ima: Allocated hash algorithm: sha1
[ 0.553710] ima: No architecture policies found
[ 0.554232] evm: Initialising EVM extended attributes:
[ 0.554765] evm: security.selinux
[ 0.555142] evm: security.SMACK64
[ 0.555494] evm: security.SMACK64EXEC
[ 0.555881] evm: security.SMACK64TRANSMUTE
[ 0.556308] evm: security.SMACK64MMAP
[ 0.556692] evm: security.apparmor
[ 0.557060] evm: security.ima
[ 0.557367] evm: security.capability
[ 0.557750] evm: HMAC attrs: 0x1
[ 0.558401] PM: Magic number: 7:314:821
[ 0.558969] RAS: Correctable Errors collector initialized.
[ 0.561277] Freeing unused decrypted memory: 2036K
[ 0.562714] Freeing unused kernel image (initmem) memory: 2932K
[ 0.589485] Write protecting the kernel read-only data: 30720k
[ 0.597456] Freeing unused kernel image (text/rodata gap) memory: 2036K
[ 0.604078] Freeing unused kernel image (rodata/data gap) memory: 1796K
[ 0.659967] x86/mm: Checked W+X mappings: passed, no W+X pages found.
[ 0.660538] Run /init as init process
Please press Enter to activate this console.
/ #
虚拟机中执行top
多核虚拟化
测试平台有8个核,代码中默认是按照实际核数给的VCPU设定,所以上图我们可以看到有8个CPU在活跃。
从代码中可以看到,每个VCPU对应HOST进程上的一个线程,我们可以随便指定任意多的VCPU,通过--cpus选项:
$ sudo ./lkvm run -k ../linux-5.15.18/arch/x86/boot/bzImage -i ../busybox-1.32.0/_install/root_fs.cpio --cpus=32 --name zilong
代码分析
执行lkvm的后续参数表示将要执行的二级函数入口,比如,虚拟机运行时执行的是lkvm run命令,则对应的入口函数为kvm_cmd_run:
kvm_cmd_run调用kvm_cmd_run_work继续进行虚拟机的Launch,针对每个VCPU,创建一个pthread运行GUEST OS。
设置CPUID
虚拟机在运行过程当中,执行cpuid获取CPU号时会退出虚拟机,进入HOST进行模拟:
由于每个VCPU和HOST虚拟机进程的一个线程绑定,所以虚拟机初始化时需要每个VCPU线程将自己所代表的CPUID号写入HOST KVM驱动中,用于Guest OS在从NON-ROOT模式退出到ROOT模式后,在HOST KVM Driver 中实现对CPUID的模拟。所以,接下来每个VCPU线程会有设置CPUID的动作。
依次执行:kvm_cpu_thread->kvm_cpu__start->kvm_cpu__reset_vcpu->kvm_cpu__setup_cpuid.
IO虚拟化
内核KVM模块提供了一种机制,可以将一片区域注册为IOTRAP,当guest os 访问这篇区域的时候,将会触发其退出NON-ROOT模式进入HOST,借助这种机制实现对IO的虚拟化。核心函数为:
当发生陷入,GUEST OS退出到HOST:
依次执行:
kvm_cpu__emulate_io->kvm__emulate_io->mmio->mmio_fn(vcpu, port, data, size, is_write, mmio->ptr);
最终执行kvm_register_iotrap注册的回调函数mmio_fn实现差异化IO设置,这一点和QEMU TCG有点像,只是TCG用手工翻译插入helper实现陷入,而KVM依赖硬件支持的陷入,回调流程非常相似。
至此测试完成,后面在逐步解剖KVMTOOL的代码实现,对虚拟化的实现原理加深认知。
注意事项
主机系统需要支持CPU虚拟化硬件加速,对于INTEL的处理器,需要支持VT-X,对于AMD处理器,需要支持AMD-V,如果不支持,执行时将会报告如下错误,lscpu 后才知道,原来运行平台是一台VMWare虚拟机。
~/Workspace$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 4
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz
Stepping: 7
CPU MHz: 2793.437
BogoMIPS: 5586.87
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 22528K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsaves arat pku ospke md_clear flush_l1d arch_capabilities
~/Workspace$
~/Workspace$ ls -l /dev/kvm
ls: cannot access '/dev/kvm': No such file or directory
~/Workspace$
目标OS 架构和HOST OS架构必须相同,虽然事先知道KVM只支持相同ISA的具有虚拟化功能CPU,但实验开始时没有留意这一点,使用上面博客中ARM版本镜像zImage和文件系统去启动,结果执行后卡死,后面联想到QEMU,才恍然大悟。
也是由于第二点的操作才知道的一个细节,bzImage文件只有x86才有,ARM架构下虽然也支持 make bzImage编译指令,但是编译出来的实际上是zImage,没有bzImage.
总结
KVM Hypervisor属于II型虚拟机,自然基于KVM实现的QEMU-KVM和kvmtool都属于II型虚拟机的实现, kvmtool和QEMU非常类似,整体架构如下图所示:
作为一个轻量的KVM虚拟机实现,后面可以研究一下代码,看KVMTOOL是如何从头开始启动一个kernel的,深入了解虚拟化原理,之后在学习其它模块,比如virtio以及IO虚拟化的时候,会非常有帮助。
参考资料
https://blog.csdn.net/Linux_Everything/article/details/117538064
https://zhuanlan.zhihu.com/p/545241171
https://zhuanlan.zhihu.com/p/583203148
https://blog.csdn.net/qq_41146650/article/details/124595502