Skip to content

Add xsimd::get<>() for optimized compile-time element extraction#1294

Open
DiamonDinoia wants to merge 2 commits intoxtensor-stack:masterfrom
DiamonDinoia:feat/optimize-elem-extraction
Open

Add xsimd::get<>() for optimized compile-time element extraction#1294
DiamonDinoia wants to merge 2 commits intoxtensor-stack:masterfrom
DiamonDinoia:feat/optimize-elem-extraction

Conversation

@DiamonDinoia
Copy link
Copy Markdown
Contributor

Add a free function xsimd::get(batch) API mirroring std::get(tuple) for fast compile-time element extraction from SIMD batches.

Per-architecture optimized kernel::get overloads using the fastest available intrinsics:

  • SSE2: shuffle/shift + scalar convert
  • SSE4.1: pextrd/pextrq/pextrb/pextrw, bitcast + pextrd for float
  • AVX: vextractf128/vextracti128 + SSE4.1 delegate
  • AVX-512: vextracti64x4/vextractf32x4 + AVX delegate
  • NEON: vgetq_lane_* (single instruction for all types)
  • NEON64: vgetq_lane_f64

Also fixes a latent bug in the common fallback for complex batch compile-time get (wrong buffer type).

@DiamonDinoia DiamonDinoia force-pushed the feat/optimize-elem-extraction branch 2 times, most recently from 0b6d85f to c6dd311 Compare April 14, 2026 14:38
@DiamonDinoia
Copy link
Copy Markdown
Contributor Author

Nice thanks for fixing CI!

This is ready for review. Once approved I will rewrite the history. I don't want to trigger a useless CI run.

@DiamonDinoia DiamonDinoia marked this pull request as ready for review April 14, 2026 17:27
Comment thread test/test_batch_complex.cpp Outdated
void check_get_all(batch_type const& res, std::index_sequence<Is...>) const
{
int dummy[] = { (check_get_element<Is>(res), 0)... };
(void)dummy;
Copy link
Copy Markdown
Contributor

@serge-sans-paille serge-sans-paille Apr 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could check that loading the generated array ends up being equal to res, right?

Copy link
Copy Markdown
Contributor

@serge-sans-paille serge-sans-paille left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the testing so that we have a decent confidence in the getter when index != 0

@DiamonDinoia
Copy link
Copy Markdown
Contributor Author

Please fix the testing so that we have a decent confidence in the getter when index != 0

Yes, I will! I also noticed some smalle changes I should make. I just did not have time to get to this still.

@DiamonDinoia DiamonDinoia force-pushed the feat/optimize-elem-extraction branch from 049e9ee to 22f9a1e Compare April 17, 2026 16:54
Adds a new public API `xsimd::get<I>(batch)` that extracts a compile-time
indexed lane from a batch. Unlike the runtime `batch::get(i)`, the index is
a template parameter so each arch can dispatch to the best single-op path.

Design per architecture (objdump-verified, pure -march flags, no reliance
on compiler optimization):

- SSE2: `first` for I==0; 32/64-bit (int, float, double) go through
  `swizzle + first` so the xsimd permute API emits the shuffle; 8/16-bit
  stay on `psrldq + movd` because sse2 swizzle expands to 2 ops for
  broadcast-to-lane-0 (pshuflw/pshufhw + unpck) while srli keeps it at 1.
- SSE4.1: native `pextrb/w/d/q` for integer (1 op); float override removed
  so it falls through to sse2's swizzle path (equivalent 1-op codegen).
- AVX/AVX2: half-extract + delegate to sse4_1 (1 op low half, 2 ops upper
  half — hardware lower bound).
- AVX-512F: `valignd`/`valignq` rotate + extract for float/double — 1 op
  for every I, including upper half (was 2). Integer keeps the extract +
  pextr* split (2 ops, optimal).
- NEON/NEON64: native per-lane `mov`/`umov v.X[I]` (1 op).
- RVV: skip `vslidedown` when I==0.

Tests build `array_type { xsimd::get<Is>(res)... }` via pack-initialization,
compare against the reference array, and verify that reloading the extracted
values reproduces the original batch.

Verified on sse2, sse4.1, avx2, avx-512 (sde), aarch64 (qemu), rvv (qemu).
@DiamonDinoia DiamonDinoia force-pushed the feat/optimize-elem-extraction branch from 22f9a1e to 2e030f2 Compare April 17, 2026 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants