Skip to content

perf: add native AVX2 uint64/int64 mul kernel#1306

Merged
DiamonDinoia merged 1 commit intoxtensor-stack:masterfrom
DiamonDinoia:fix/avx2-uint64-mul
Apr 16, 2026
Merged

perf: add native AVX2 uint64/int64 mul kernel#1306
DiamonDinoia merged 1 commit intoxtensor-stack:masterfrom
DiamonDinoia:fix/avx2-uint64-mul

Conversation

@DiamonDinoia
Copy link
Copy Markdown
Contributor

Previously batch<[u]int64_t, avx2> mul fell through to AVX, which has no integer mul, which in turn fell through to SSE4.1 — splitting each 256-bit register into two 128-bit halves (vextracti128/vinserti128) and running the mul_epu32 sequence twice.

Add a sizeof(T)==8 specialization using _mm256_mul_epu32 directly, mirroring the SSE4.1 pattern with 256-bit intrinsics. Generates 8 ymm ops: 2 vpshufd, 3 vpmuludq, 2 vpaddq, 1 vpsllq — no lane splitting.

AVX512F (without DQ) also benefits since it forwards to the AVX2 kernel.

Previously batch<[u]int64_t, avx2> mul fell through to AVX, which has no
integer mul, which in turn fell through to SSE4.1 — splitting each 256-bit
register into two 128-bit halves (vextracti128/vinserti128) and running the
mul_epu32 sequence twice.

Add a sizeof(T)==8 specialization using _mm256_mul_epu32 directly, mirroring
the SSE4.1 pattern with 256-bit intrinsics. Generates 8 ymm ops: 2 vpshufd,
3 vpmuludq, 2 vpaddq, 1 vpsllq — no lane splitting.

AVX512F (without DQ) also benefits since it forwards to the AVX2 kernel.
@DiamonDinoia
Copy link
Copy Markdown
Contributor Author

@serge-sans-paille this is a small one :)

@serge-sans-paille
Copy link
Copy Markdown
Contributor

yep, looks good!

@serge-sans-paille
Copy link
Copy Markdown
Contributor

You should ahve the right to merge now, feel free to do so once green.

@DiamonDinoia DiamonDinoia merged commit d05d46f into xtensor-stack:master Apr 16, 2026
73 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants