Skip to content

[Performance] SIMD Vector512 Coverage Gaps #614

Description

@Nucs

Overview

#579 made the elementwise kernel path adaptive (VectorBits ∈ {128, 256, 512}, detected once at startup), so add/sub/mul/div/comparisons/shift/modf and the unary-math ops already use Vector512 on capable hardware. But the migration stopped there: the reduction, NaN-masking, cast, and matmul subsystems are still hardcoded to Vector256<T> and never widen to Vector512. On AVX-512 hosts (Intel Ice Lake+/Sapphire Rapids, AMD Zen 4/5) these run at half the lane width they could.

This issue tracks completing the adaptive-width migration for the remaining subsystems.

Parent: #579 (Adaptive Vector Width: V128/256/512)

Problem

AVX-512 appears in only 4 files of Backends/Kernels/. Everywhere else the SIMD body is written against Vector256<T> with zero VectorBits references — i.e. it is width-locked at 256 and cannot use a 512-bit register even when one is available.

Survey (Vector256< occurrences, VectorBits-refs = 0 ⇒ width-locked):

Subsystem Files (Vector256< count) What stays 256-bit on AVX-512
Axis reductions Reduction.Axis.Widening(66), .Simd(58), .VarStd(42), .Boolean(8), .NaN(8) sum/prod/min/max/mean/var/std/any/all + nan-variants along an axis
NaN masking Masking.NaN(44), Masking.VarStd(16), Masking.Boolean(8) nansum/nanmean/nanvar/nanstd/nanmin/nanmax
Flat reductions Reduction.Boolean(12), Reduction.Arg(4), ILKernelGenerator.Reduction(3) flat any/all, argmin/argmax, pairwise sum/prod
Weighted sum WeightedSum(14) np.average
Cast / astype Cast.ToHalf(16), .ToBool(12), .Complex(9), .Half(8, partial 512), .FloatToUInt(5), .FloatNarrow, .FloatWideInt, .IntNarrow, .ShortNarrow, .Subword{Copy,Narrow,Widen} most astype conversions are AVX2-only (source comments say "AVX2-only … no AVX512")
MatMul / dot SimdMatMul.{cs,Double,Strided}, MatMul(5) zero AVX-512 — dot/matmul cap at V256

Already adaptive (use VectorBits, no change needed): Binary, Comparison, Shift, Modf, InnerLoop, and the load/store/op emit helpers in DirectILKernelGenerator.cs.

Proposal

Apply the #579 adaptive-width pattern to each width-locked subsystem: drive lane count and load/store/op from VectorBits (via VectorMethodCache) instead of a literal Vector256, with a runtime capability probe + scalar/256 fallback so nothing regresses on non-AVX-512 hosts. Each item is independently shippable; correctness can be verified on a non-AVX-512 dev box via JIT software-emulation of Vector512 (the same technique used to verify the Round/Truncate fix in dde0a0a9 / a0581c6f), with the actual speedup benchmarked on AVX-512 hardware.

  • Axis reductionsReduction.Axis.Simd, .Widening, .VarStd, .Boolean, .NaN (sum/prod/min/max/mean/var/std/any/all along an axis). Largest cluster; the horizontal-reduction tail needs a width-specific lane-fold.
  • NaN-aware masking & reductionsMasking.NaN, Masking.VarStd, Masking.Boolean, Reduction.NaN (nansum/nanmean/nanvar/nanstd/…).
  • Flat reductionsReduction.Boolean (any/all), Reduction.Arg (argmin/argmax), ILKernelGenerator.Reduction (pairwise sum/prod), WeightedSum (np.average).
  • Cast / astype subsystem — widen the AVX2-only conversion paths. Note the AVX-512 shuffles differ from AVX2 (e.g. VPERMB needs AVX512VBMI, VPMOVZX/VPMOV* truncating moves), so this is the most involved bucket and should be width-probed per conversion.
  • MatMul / dot — give SimdMatMul a Vector512 microkernel (most likely to show a clean win since matmul is compute-bound).
  • x86-512 binary routing polishVectorMethodCache.ResolveX86BinaryApi(512, …) returns Avx512F for everything; wire Avx512BW (byte/word add/sub/min/max) and Avx512DQ (int64 multiply) so those take the x86 fast path instead of the cross-platform fallback. (Correctness-neutral — today it falls back to Vector512.*, which the JIT lowers correctly; this is pure perf.)

Evidence

  • Width-lock confirmed by grep: the files above contain Vector256<…> with 0 VectorBits references, so the emitted IL pins a 256-bit register regardless of VectorBits.
  • Reduction.Axis.Simd.csAxisReductionSimdHelper<T> is generic over T but fixed at Vector256<T>.
  • Cast.ToHalf.cs header documents the AVX2-only design ("i64/u64 → f16 (Wave 17, AVX2-only)…").
  • SimdMatMul.* contains no Avx512/Vector512 references.
  • AVX-512 currently used only in: Cast.cs, Cast.Half.cs, DirectILKernelGenerator.cs, VectorMethodCache.cs.

Scope / Non-goals

  • Non-goal: the elementwise path ([SIMD] Adaptive Vector Width: Support Vector128/256/512 Based on Hardware #579 already covers it).
  • Non-goal: new ufuncs or dtype support — this is width widening of existing kernels only, bit-for-bit identical results.
  • Non-goal: AVX-512 sub-feature detection policy beyond what each kernel needs (BW/DQ/VBMI probed where used).
  • Out of scope here: scalar-math unary ops (sin/cos/exp/log) that have no SIMD at any width.

Benchmark / Performance

  • Target: up to 2× throughput on the widened kernels on AVX-512 hosts; no regression on V256/V128/scalar hosts (capability-gated fallback).
  • Caveat: reductions and casts are frequently memory-bound, and heavy AVX-512 can trigger frequency throttling on some Intel parts — so the win is real but hardware-dependent and must be measured on Zen 4+/Ice Lake+ before claiming it. Use benchmark/layout/ (reduction × layout × dtype) and benchmark/cast/ (astype src→dst matrix) to A/B V256 vs V512.
  • Dev-box correctness without AVX-512: force VectorBits = 512 semantics and rely on the JIT's software emulation of Vector512<T> to bit-compare against the V256/scalar result.

Related issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    coreInternal engine: Shape, Storage, TensorEngine, iteratorsenhancementNew feature or requestperformancePerformance improvements or optimizations

    Type

    Fields

    No fields configured for Task.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions