Skip to content

Large scratch regions on KVM with hw-interrupts fail with EEXIST due to APIC access page overlap #1389

@ppenna

Description

@ppenna

Summary

See nanvix/nanvix#2082 for more context on why this is relevant for Nanvix on Hyperlight.

When a guest configures a large scratch region via SandboxConfiguration::set_scratch_size(), HyperlightVm::new() fails on KVM with EEXIST (Error(17)):

UpdateRegion(MapMemory(Hypervisor(KvmError(Error(17)))))

The root cause is that the scratch memory slot (KVM slot 1) overlaps with an internal KVM memory slot created by create_irq_chip() for the LAPIC/APIC access page at GPA 0xFEE00000.

Details

Scratch region placement

Hyperlight places the scratch region at the top of the 32-bit GPA space via scratch_base_gpa():

// hyperlight_common::layout
pub fn scratch_base_gpa(size: usize) -> u64 {
    (MAX_GPA - size + 1) as u64
}

With MAX_GPA = 0xFFFF_FFFF, a scratch size of e.g. 0x6882000 (~104 MB) yields:

  • scratch_base = 0xF977E000
  • Scratch KVM slot covers GPA range [0xF977E000, 0xFFFFFFFF]

KVM irqchip APIC access page

When the hw-interrupts feature is enabled, KvmVm::new() calls create_irq_chip(). On Intel hardware with APICv (or on AMD with AVIC), KVM automatically creates an internal APIC access page at GPA 0xFEE00000. This is a non-removable, non-movable memory slot managed internally by KVM.

Since 0xFEE00000 falls inside [0xF977E000, 0xFFFFFFFF], KVM rejects the set_user_memory_region call for the scratch slot with EEXIST — the two regions overlap.

Maximum safe scratch size

The maximum scratch size that avoids the APIC page is:

max_scratch = MAX_GPA - 0xFEE00000 = 0x11FFFFF ≈ 18 MB

Any set_scratch_size() value above ~18 MB will fail on KVM with hw-interrupts enabled on Intel (APICv) or AMD (AVIC) hosts.

Why this does not affect Windows WHP

On Windows, WhpVm::new() does not create an explicit interrupt controller memory slot at a fixed GPA. The WHP API (WHvMapGpaRange2) maps guest physical address ranges independently, and the platform's LAPIC emulation does not reserve a GPA slot that conflicts with user-mapped regions. This is why the same scratch configuration works on Windows.

Why this does not affect small scratch sizes

The default scratch size (DEFAULT_SCRATCH_SIZE = 0x48000 = 288 KB) places scratch at 0xFFFB8000, which is above 0xFEE00000, so there is no overlap.

Reproduction

This can be reproduced with the Nanvix project, which uses Hyperlight as a VMM backend:

  1. Clone and checkout branch enhancement-uservm-hyperlight at commit b9c50ed28 (uses Hyperlight rev 4b57b84):

    git clone https://github.com/nanvix/nanvix.git
    cd nanvix
    git checkout enhancement-uservm-hyperlight
  2. Build with Hyperlight machine target:

    ./z build -- all MACHINE=hyperlight DEPLOYMENT_MODE=standalone
  3. Run the integration test on a machine with KVM and APICv enabled (Intel bare-metal):

    ./bin/mkimage.elf -o nanvix.img \
        "bin/procd.elf;procd" \
        "bin/memd.elf;memd" \
        "bin/testd.elf;testd"
    bash scripts/run-nanvixd.sh hyperlight nanvix.img 120 \
        --wait-for-string "hello, world!"

    On bare-metal Intel with APICv, nanvixd fails immediately with the EEXIST error. On machines with APICv disabled (e.g., WSL2) or on Windows WHP, it succeeds.

    The failing CI run: https://github.com/nanvix/nanvix/actions/runs/24616269717

Proposed solutions

  1. Split the scratch KVM slot around the APIC page: When creating the scratch memory mapping on KVM with hw-interrupts, detect whether [scratch_base, scratch_end] contains 0xFEE00000 and split it into two KVM memory slots: [scratch_base, 0xFEDFFFFF] and [0xFEF00000, scratch_end]. The APIC page itself (0xFEE00000–0xFEEFFFFF) would be left for KVM's internal slot. The host-side mmap backing would remain contiguous; only the KVM slot registration would be split.

  2. Validate scratch_size against known reserved GPAs: In SandboxMemoryLayout::new(), reject scratch sizes that would cause scratch_base_gpa() to fall below 0xFEE00000 when hw-interrupts is enabled on KVM. This would at least provide a clear error message instead of an opaque KvmError(Error(17)).

  3. Document the maximum scratch size constraint: Add a note to SandboxConfiguration::set_scratch_size() and DEFAULT_SCRATCH_SIZE explaining the ~18 MB upper bound on KVM with hw-interrupts.

  4. Disable APICv on the host: Consumers running KVM on Intel can work around this by disabling APICv (sudo modprobe kvm_intel enable_apicv=0), which prevents KVM from allocating the APIC access page. This eliminates the overlap but comes at a performance cost — APIC accesses fall back to VM-exit based emulation instead of hardware-accelerated handling. This is a viable short-term workaround but does not require any Hyperlight changes.

Option 1 would be the most flexible, allowing scratch regions of arbitrary size on all platforms. Options 2 and 3 are simpler but limit the usable scratch space. Option 4 is a host-side workaround that does not require any Hyperlight changes.

Environment

  • Hyperlight revision: 4b57b8416114c489083922afa3dd9716127278fb
  • Features: kvm, hw-interrupts, nanvix-unstable, executable_heap
  • Host: bare-metal Intel x86_64, Linux, KVM with APICv enabled
  • Fails: Intel bare-metal runners (prometheus28, prometheus30, prometheus43)
  • Works: WSL2 (APICv disabled), Windows 11 WHP

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions