Large scratch regions on KVM with hw-interrupts fail with EEXIST due to APIC access page overlap

## Summary

See https://github.com/nanvix/nanvix/pull/2082 for more context on why this is relevant for Nanvix on Hyperlight.

When a guest configures a large scratch region via `SandboxConfiguration::set_scratch_size()`, `HyperlightVm::new()` fails on KVM with `EEXIST` (`Error(17)`):

```
UpdateRegion(MapMemory(Hypervisor(KvmError(Error(17)))))
```

The root cause is that the scratch memory slot (KVM slot 1) overlaps with an internal KVM memory slot created by `create_irq_chip()` for the LAPIC/APIC access page at GPA `0xFEE00000`.

## Details

### Scratch region placement

Hyperlight places the scratch region at the top of the 32-bit GPA space via `scratch_base_gpa()`:

```rust
// hyperlight_common::layout
pub fn scratch_base_gpa(size: usize) -> u64 {
    (MAX_GPA - size + 1) as u64
}
```

With `MAX_GPA = 0xFFFF_FFFF`, a scratch size of e.g. `0x6882000` (~104 MB) yields:

- `scratch_base = 0xF977E000`
- Scratch KVM slot covers GPA range `[0xF977E000, 0xFFFFFFFF]`

### KVM irqchip APIC access page

When the `hw-interrupts` feature is enabled, `KvmVm::new()` calls `create_irq_chip()`. On Intel hardware with APICv (or on AMD with AVIC), KVM automatically creates an internal **APIC access page** at GPA `0xFEE00000`. This is a non-removable, non-movable memory slot managed internally by KVM.

Since `0xFEE00000` falls inside `[0xF977E000, 0xFFFFFFFF]`, KVM rejects the `set_user_memory_region` call for the scratch slot with `EEXIST` — the two regions overlap.

### Maximum safe scratch size

The maximum scratch size that avoids the APIC page is:

```
max_scratch = MAX_GPA - 0xFEE00000 = 0x11FFFFF ≈ 18 MB
```

Any `set_scratch_size()` value above ~18 MB will fail on KVM with `hw-interrupts` enabled on Intel (APICv) or AMD (AVIC) hosts.

### Why this does not affect Windows WHP

On Windows, `WhpVm::new()` does not create an explicit interrupt controller memory slot at a fixed GPA. The WHP API (`WHvMapGpaRange2`) maps guest physical address ranges independently, and the platform's LAPIC emulation does not reserve a GPA slot that conflicts with user-mapped regions. This is why the same scratch configuration works on Windows.

### Why this does not affect small scratch sizes

The default scratch size (`DEFAULT_SCRATCH_SIZE = 0x48000` = 288 KB) places scratch at `0xFFFB8000`, which is above `0xFEE00000`, so there is no overlap.

## Reproduction

This can be reproduced with the [Nanvix](https://github.com/nanvix/nanvix) project, which uses Hyperlight as a VMM backend:

1. Clone and checkout branch `enhancement-uservm-hyperlight` at commit `b9c50ed28` (uses Hyperlight rev `4b57b84`):
   ```bash
   git clone https://github.com/nanvix/nanvix.git
   cd nanvix
   git checkout enhancement-uservm-hyperlight
   ```

2. Build with Hyperlight machine target:
   ```bash
   ./z build -- all MACHINE=hyperlight DEPLOYMENT_MODE=standalone
   ```

3. Run the integration test on a machine with KVM and APICv enabled (Intel bare-metal):
   ```bash
   ./bin/mkimage.elf -o nanvix.img \
       "bin/procd.elf;procd" \
       "bin/memd.elf;memd" \
       "bin/testd.elf;testd"
   bash scripts/run-nanvixd.sh hyperlight nanvix.img 120 \
       --wait-for-string "hello, world!"
   ```

   On bare-metal Intel with APICv, nanvixd fails immediately with the EEXIST error. On machines with APICv disabled (e.g., WSL2) or on Windows WHP, it succeeds.

   The failing CI run: https://github.com/nanvix/nanvix/actions/runs/24616269717

## Proposed solutions

1. **Split the scratch KVM slot around the APIC page**: When creating the scratch memory mapping on KVM with `hw-interrupts`, detect whether `[scratch_base, scratch_end]` contains `0xFEE00000` and split it into two KVM memory slots: `[scratch_base, 0xFEDFFFFF]` and `[0xFEF00000, scratch_end]`. The APIC page itself (`0xFEE00000–0xFEEFFFFF`) would be left for KVM's internal slot. The host-side mmap backing would remain contiguous; only the KVM slot registration would be split.

2. **Validate scratch_size against known reserved GPAs**: In `SandboxMemoryLayout::new()`, reject scratch sizes that would cause `scratch_base_gpa()` to fall below `0xFEE00000` when `hw-interrupts` is enabled on KVM. This would at least provide a clear error message instead of an opaque `KvmError(Error(17))`.

3. **Document the maximum scratch size constraint**: Add a note to `SandboxConfiguration::set_scratch_size()` and `DEFAULT_SCRATCH_SIZE` explaining the ~18 MB upper bound on KVM with `hw-interrupts`.

4. **Disable APICv on the host**: Consumers running KVM on Intel can work around this by disabling APICv (`sudo modprobe kvm_intel enable_apicv=0`), which prevents KVM from allocating the APIC access page. This eliminates the overlap but comes at a performance cost — APIC accesses fall back to VM-exit based emulation instead of hardware-accelerated handling. This is a viable short-term workaround but does not require any Hyperlight changes.

Option 1 would be the most flexible, allowing scratch regions of arbitrary size on all platforms. Options 2 and 3 are simpler but limit the usable scratch space. Option 4 is a host-side workaround that does not require any Hyperlight changes.

## Environment

- Hyperlight revision: `4b57b8416114c489083922afa3dd9716127278fb`
- Features: `kvm`, `hw-interrupts`, `nanvix-unstable`, `executable_heap`
- Host: bare-metal Intel x86_64, Linux, KVM with APICv enabled
- Fails: Intel bare-metal runners (prometheus28, prometheus30, prometheus43)
- Works: WSL2 (APICv disabled), Windows 11 WHP

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large scratch regions on KVM with hw-interrupts fail with EEXIST due to APIC access page overlap #1389

Summary

Details

Scratch region placement

KVM irqchip APIC access page

Maximum safe scratch size

Why this does not affect Windows WHP

Why this does not affect small scratch sizes

Reproduction

Proposed solutions

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Large scratch regions on KVM with hw-interrupts fail with EEXIST due to APIC access page overlap #1389

Description

Summary

Details

Scratch region placement

KVM irqchip APIC access page

Maximum safe scratch size

Why this does not affect Windows WHP

Why this does not affect small scratch sizes

Reproduction

Proposed solutions

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions