Skip to content

feat: NumPy 2.x parity — byte export, file I/O, dtype resolution & np.rint (+ all-15-dtype fuzz)#616

Open
Nucs wants to merge 24 commits into
masterfrom
development-journey1
Open

feat: NumPy 2.x parity — byte export, file I/O, dtype resolution & np.rint (+ all-15-dtype fuzz)#616
Nucs wants to merge 24 commits into
masterfrom
development-journey1

Conversation

@Nucs

@Nucs Nucs commented Jul 2, 2026

Copy link
Copy Markdown
Member

Overview

Splits the non-drawing NumPy 2.x parity work off the nditer line onto a clean branch based on master. 24 commits covering byte export / casting, file I/O, dtype resolution, the new np.rint ufunc, printing fixes, and a large differential-fuzz expansion. The 5 imaging commits (NumSharp.Drawing*) are intentionally excluded and remain on nditer.

What's included

Byte export & casting

  • tobytes(order) — full ndarray.tobytes C/F/A/K parity
  • (breaking) dissolve legacy NDArray.ToByteArraytobytes is the sole NumPy byte export
  • ndarray.astype(casting=) gate — closes the last type-conversion parity gap
  • ToByteArray() returns logical C-order bytes; GetData<T>/ToArray<T> dtype-mismatch semantics pinned

File I/O

  • np.fromfile — full parity: count / offset / sep (text) / default-dtype / Stream
  • np.tofile(sep, format) — full ndarray.tofile parity + C-order fix

Dtype resolution

  • Char promotes/computes as the uint16 masquerade (6 Char bugs + collected)
  • int8/float16 avoidance across the type-resolution surface
  • np.min_scalar_type — stop avoiding int8/float16
  • normalize non-0/1 bool bytes everywhere bool is used numerically (reductions/casts)

Math

  • np.rint — round-half-to-even ufunc, reusing UnaryOp.Round (float-tier promotion, out=/where=)

Printing

  • tofile / scalar-str parity: #-flag decimal point + float32/16 scientific threshold

Testing — differential fuzz (NumPy oracle)

  • grids widened toward all 15 dtypes; Char woven in via the uint16 proxy
  • independent C# Decimal oracle: unary/binary/reduce/scan/power/var/std/matmul/astype/stat/where/sort/manip
  • Group A: ~30 array→array np.* ops wired into the oracle grids (5 bugs found, pinned under [OpenBugs])
  • NEP50 out-of-range python-scalar gap reproduced

Docs / benchmark

  • website 15-dtype table, namespace overwrites, TOC links, benchmark coach page
  • 2026-06-29 benchmark history snapshot (recovers bitwise after an intermittent AccessViolation)
  • UnmanagedMemoryBlock(T*, long) ownership remark fix

Breaking changes

Change Impact Migration
NDArray.ToByteArray dissolved callers of ToByteArray() use tobytes() (optionally tobytes(order))

Verification

  • Branch = master + 24 cherry-picked commits. git diff <merge-base>..master -- src/ is empty, so the picks sit on exactly their authored src/ baseline (no divergence risk).
  • Solution builds clean: 0 errors.
  • One binary conflict during cherry-pick (docs/website-src/images/benchmark-dashboard.png, add/add) resolved in favor of master's curated "scoreboard" image; the textual benchmarks-dashboard.md rework was applied.

Nucs added 24 commits July 1, 2026 19:55
Removed "Deepak Kumar Battini" from the <Authors> metadata in
src/NumSharp.Core/NumSharp.Core.csproj, which feeds the NuGet package
author attribution shown on nuget.org and in the generated .nupkg.
The Authors field now reads: "Eli Belash, Haiping Chen, Meinrad Recheis".

Scope notes for future reference:
- The @deepakkumar1984 GitHub-handle references in docs/issues/
  (issue #362 archive + categories.md) were intentionally left as-is:
  they are factual attributions of who filed/commented on that real
  GitHub issue, not a self-authored project credit.
- The "Deepak Kumar Gouda" entry in src/numpy/doc/changelog/ is a
  different person inside the third-party NumPy reference clone and is
  out of scope.
…tes parity)

ToByteArray() copied Storage.InternalArray.BytesLength bytes from Storage.Address
— the RAW underlying buffer — ignoring strides, offset and broadcasting. A
non-contiguous view therefore leaked the wrong data and the wrong length: e.g.
arange(10)[::2] (logical [0,2,4,6,8], 20 bytes) returned the full 40-byte parent
buffer [0..9]; a transpose returned F-order bytes; a broadcast view returned only
the unstretched source.

ToByteArray() now mirrors numpy.ndarray.tobytes(): it returns the LOGICAL array in
C (row-major) order, always exactly size*dtypesize bytes. A pristine C-contiguous
offset-0 buffer (Shape.IsSliced == false) keeps the fast single-memcpy path; any
sliced/strided/transposed/broadcast/offset view is first materialized via copy('C').
No internal callers depended on the old raw-buffer behavior; the legacy
pristine-contiguous contract (and its existing test) is unchanged.

Tests (31 new; full CI-style suite 11197 passed / 0 failed / 11 skipped):
- Casting/NDArray.ToByteArray.Test.cs: every view type (contiguous, prefix,
  middle-slice, strided, reversed, transpose, column, broadcast, 3D, 0-d, empty)
  across all 15 dtypes; exact NumPy-verified bytes; frombuffer round-trips;
  detached-copy semantics.
- Interop/NumpyByteContractTests.cs: NumSharp<->NumPy 2.4.2 byte contract —
  endianness (big-endian string dtype byteswaps; NPTypeCode path is little-endian),
  NaN-payload / signaling-NaN / subnormal / -0 bit-exact preservation
  (Half/Single/Double/Complex), uint64 above int64.max, complex part layout,
  Char<->uint16 code units incl. surrogate pairs, Decimal having no numpy dtype,
  and bool non-0/1-byte storage/truthiness/reduction semantics.
…twise after intermittent AccessViolation

Full official NumSharp-vs-NumPy run on branch nditer @ 2d16f477 (i9-13900K,
.NET 10.0.101 / Release, NumPy 2.4.2). Runs the op/dtype/N matrix (14 suites ×
15 dtypes × 1K/100K/10M = 1851 cells) plus the five appended subsystems
(NDIter, Layout, Operand, Cast, Fusion). Repoints benchmark/history/latest ->
2026-06-29_2d16f477.

Headline (NPY/NS, >1 = NumSharp faster):
  1,000      1.13x   (116 / 58 / 35 / 11)
  100,000    0.98x   (295 / 135 / 124 / 35)
  10,000,000 1.39x   (427 / 136 / 24 / 4)
  Overall: 1851 ops | ✅ 838 | 🟡 329 | 🟠 183 | 🔴 50 | ▫ 388 | ⚪ 63
  NDIter 1.20x geomean · Cast 1450 win / 118 lag.

Bitwise recovery
----------------
The original full run (~10782s) hit the known intermittent AccessViolation
mid-run, inside the in-process Bitwise suite (BitwiseBenchmarks.LeftShift ->
SimdScalarShiftDispatch<int> -> IL_ShiftLeft_Scalar_Int32). Because the official
config runs InProcessEmit (one process per suite), that single fault took the
whole suite down and dropped all 81 bitwise op×dtype×N cells (⚪ jumped 69→225).

The bitwise suite was re-run from the same HEAD (clean, no source change), it
completed without fault (the bug is rare), and its BenchmarkDotNet JSON was
merged back into the op matrix. The NDIter+Layout+Operand+Cast+Fusion sections
were spliced back verbatim. Bitwise cells now carry data (⚪ 225→63), e.g.
np.left_shift int32 @10m = 1.67x.

Investigation (filed #615)
-------------------------------------------
The shift/bitwise kernels are the VICTIM, not the cause. ~13M faithful ops did
not reproduce the fault, and the built-in guard-page detector (NUMSHARP_GUARD_PAGES=1)
found ZERO out-of-bounds across the entire bitwise surface (setup casts + all six
ops × all integer dtypes × all three sizes). AccessViolation (unmapped memory) +
no OOB ⇒ an unmanaged-storage LIFETIME race (use-after-free under GC pressure),
not an overrun — most likely a raw .Address pointer handed to an emitted kernel
without a GC.KeepAlive, or a pool/finalizer free-list race. Full diagnosis,
repro harness, and next steps in issue #615.

Provenance: MANIFEST records clean HEAD (benchmarked code unchanged). The bitwise
timings were measured in a separate process from the same commit on the same quiet
machine — identical methodology to the orchestrator's per-suite isolation.
…s used numerically

A boolean's numeric value is exactly 0 or 1 (NumPy: every nonzero counts as 1), but a
bool buffer can legally hold non-0/1 bytes: np.frombuffer returns a zero-copy VIEW (like
NumPy), and framework interop (Numpy.NET, P/Invoke, mmap, network) wraps foreign buffers.
NumSharp aliased bool to uint8 storage in several SIMD/scalar fast paths and
accumulated/reinterpreted the RAW byte (e.g. byte 255 contributed 255), diverging from
NumPy across sum-family reductions and narrow casts.

Four independent paths read the raw byte instead of normalizing nonzero->1:

  1. Flat sum/prod/mean  - DirectILKernelGenerator.cs::EmitConvertTo widened the raw byte
     for a Boolean source. Now normalizes (ldc.i4.0; cgt.un) before widening, mirroring the
     existing to==Boolean '!= 0'. Covers flat reductions and scans.
  2. Axis sum/prod/mean  - Reduction.Axis.Widening.cs aliased (Boolean,*) to the
     byte-widening SIMD kernel. Removed the 3 bool entries so bool falls through to the
     scalar reducer (CombineScalarsPromoted -> ConvertToInt64Bits/ConvertToDouble), which
     already normalizes via '!= 0'.
  3. Flat var/std        - Default.Reduction.Var.cs::VarMomentsRealDispatch read bool as
     byte. Added VarMomentsBool (reads byte != 0 ? 1.0 : 0.0).
  4. Narrow casts astype(bool->i8/u8/i16/u16/f16) - the SIMD subword/widen/xToHalf cast
     kernels reinterpreted the raw byte (bool->wide already used the scalar normalizing
     path). TryGetCastKernel/TryGetStridedCastKernel now return null for a Boolean source,
     routing every bool cast to NDIterCasting.ConvertValue (normalizes via ReadAsInt64
     'bool ? 1 : 0'). Also makes the dispatch match its own documented contract.

Before/after (bytes [0,1,2,3,255,0,127,128] -> logical [F,T,T,T,T,F,T,T]):
  sum 516->6, mean 64.5->0.75, var 8033.75->0.1875, std 89.63->0.4330127
  2D sum total 10->4, axis0 [4,6,0]->[2,2,0], axis1 [3,7]->[2,2]
  astype(u8) [0,1,2,3,255,..]->[0,1,1,1,1,0,1,1]

Min/max/any/all/argmin/argmax over bool were already correct (result casts back to bool,
so nonzero->True is preserved); proper 0/1 bool buffers are unaffected (normalization is
idempotent).

Verification (NumPy 2.4.2 as oracle):
  - New regression suite test/NumSharp.UnitTest/BoolNonBinaryReductionTests.cs: 19 tests
    (flat+axis sum/prod/mean/var/std, 5 narrow + 2 wide casts, 0/1 guard), green on
    net8.0 and net10.0.
  - Differential probes: 38/38 and 30/30 ops bit-equal to NumPy.
  - No regression: full suite 11166 pass / 0 fail; FuzzMatrix 69/69 corpora; 1290
    reduction tests.

Trade-off: bool->{i8,u8,i16,u16,f16} casts lose their SIMD subword fast path (now scalar).
A perf-preserving alternative is a one-pass SIMD clamp min(byte,1) before the existing
kernel; deferred.
…r bugs + collected)

Char is NumSharp's representation of a 16-bit unsigned integer (System.Char /
UTF-16 code unit) with no direct NumPy dtype — its only sound analogue is uint16,
so Char MUST promote and compute bit-identically to uint16 everywhere. A
differential-fuzz Char gate (generate every op as uint16 in NumPy 2.4.2, relabel
uint16->char, assert bit-identical) surfaced six places where it did not. All
expected values below were probed against NumPy 2.4.2 as uint16.

Bug 1 — promote(Char, *) ranked Char in group 0 / priority 0 (with uint8/Byte),
  and the static arr_arr / arr_scalar promotion tables carried wrong Char rows
  (e.g. (char,uint8)->uint8). char[321] + uint8[65] returned dtype Byte / value
  130 (386 truncated to a byte). Fixed every Char entry to mirror uint16's rank:
    arr_arr:    (char,uint8)->char, (char,int8/int16)->int32, (char,float16)->float32
                and the symmetric (uint8/int8/int16/float16, char) entries.
    arr_scalar: char ARRAY wins over every integer/bool scalar (-> char);
                (char,float16) scalar -> float32.
  char + uint8 now yields a 2-byte-unsigned result (value 386), not a truncated byte.

Bug 1b — the same mis-rank computed comparisons at the truncated width:
  greater(char 321, uint8 65) returned False and equal returned True (both
  operands collapsed to 65). Fixed by the same table change (comparisons resolve
  their compare-width through result_type) -> greater True, equal False.

Bug 2 — reciprocal(char) returned Double because IsInteger() deliberately excludes
  Char, so it fell to the float loop. NumPy reciprocal(uint16) takes the integer
  loop (1//x, 0 for |x|>=2). Admit Char to the integer branch and added a Char
  case to ReciprocalInteger (preserves Char, narrow-type 1/0 -> 0 sentinel).

Bug 3 — power(char, float32) returned Double. Root cause was a power-promotion
  override `lhsGroup<=2 && rhsGroup==3 -> float64` that fired for EVERY int-base **
  float-exp. That is a NumPy-1.x rule NEP50 removed: result_type already computes
  power promotion (uint16**f32->f32, int32**f32->f64, weak-int**f_arr->f_arr).
  Removed the override. Also fixes the broader pre-existing bug where
  {bool,int8,int16,uint8,uint16} ** float32 wrongly upcast to float64.

Bug 4 — power crashed on a char scalar exponent: the scalar-exponent fast path
  read it via Convert.ToDouble(char), whose IConvertible.ToDouble throws
  InvalidCastException. Read the char code point directly (mirrors the existing
  Half special-case). power(uint16, 2) now returns uint16 [9,16].

Bug 5 — (Boolean, Char) was missing from the arr_scalar table, so a bool array op
  a char SCALAR threw KeyNotFoundException '(Boolean, Char)'. Not bitwise-specific
  — any binary op hit it (add too). Added (bool,char)->char.

Bug 6 — invert(char) with N >= SIMD width threw NotSupportedException: the
  BitwiseNot SIMD path emitted Vector<char>, which the BCL does not have
  (CanUseSimd(Char) is already false for the same reason). Excluded Char from the
  BitwiseNot SIMD eligibility gate; the scalar path computes ~x bit-exactly.

Collected along the way:
  - AsNumpyDtypeName(Char) reported "uint8" (Char is 2 bytes) -> now "uint16";
    graduated the [OpenBugs] T1_33 audit test that documented it.
  - The power-override removal also corrects the non-Char int**float32 promotions.
  - bool-array + char-scalar failed for ALL binary ops, not just bitwise.

Verification:
  - Char≡uint16 differential in NumSharp (uint16 is NumPy-validated):
    496 binary/unary op × dtype × order combos, 11 memory layouts
    (strided/reversed/transposed/broadcast/2D/axis), 36 astype combos — 0 diverged.
  - New CharUInt16MasqueradeTests.cs (26 tests) pins all six bugs with NumPy-probed
    values plus arr_arr/arr_scalar table differentials and memory-layout coverage.
  - Updated find_common_type Case23 (char+int16: Int16 -> Int32, matching uint16).
  - Full suite green: 11224 passed / 0 failed (net10.0, CI filter); FuzzMatrix
    NumPy-oracle gate green; net8.0 + net10.0 compile clean.
…cted during Char work)

Collected while making Char promote as the uint16 masquerade — the Char≡uint16
differential vs NumPy 2.4.2 surfaced it, but it is NOT Char-specific: it affects
every integer dtype.

NumPy 2.x (NEP50) folds a Python int operand into the array's dtype but FIRST
range-checks it, raising OverflowError before any element-wise work when the value
is not representable:
  np.array([1,2],uint16) + 70000   -> OverflowError (70000 > 65535)
  np.array([1,2],uint16) * -1      -> OverflowError (-1 < 0, unsigned)
  np.array([1],int8)    + 200      -> OverflowError (200 > 127)
  np.power(np.array([2,3],uint16),-1) -> OverflowError (-1 out of uint16 range)
In-range scalars whose RESULT overflows wrap fine (uint16[1,2]-5 -> [65532,65533]).

NumSharp promotes a C# scalar purely TYPE-based (the arr_scalar table), so it
silently coerces/wraps the value instead of range-checking it:
  uint16[1,2] + 70000 -> [4465,4466]   uint16[1,2] * -1 -> [65535,65534]
  int8[1] + 200 -> wraps               power(uint16[2,3],-1) -> [0,43691]
The power case is doubly wrong: no throw AND a nonsense modular inverse (43691 == 3^-1 mod 2^16).

Inconsistency proving this is a gap, not a design choice: NumSharp's OWN fused path
already enforces it — np.evaluate throws the exact OverflowException
(NDExpr.Typing.cs "Python integer {value} out of bounds for {dtype}"). Only the
operator / ufunc path skips the check.

Adds 5 [OpenBugs] reproductions (asserting the correct OverflowError; failing today,
excluded from CI via TestCategory!=OpenBugs). Remove [OpenBugs] when the operator/
ufunc weak-scalar path range-checks the value like np.evaluate already does.
… (Char woven, Decimal oracle)

Extend the NumPy differential-fuzz corpora to cover the full NumSharp dtype matrix
across every op, instead of the 13 NumPy-representable dtypes on a curated subset of
ops. The two NumPy-orphan dtypes (Char, Decimal) are now first-class grid members, and
the per-mode dtype axes are widened toward ALL_DTYPES. Net corpus growth: the op corpus
went 35.5K -> 53.3K cases (+16.3K, ~46%) plus 234 new Decimal cases; committed corpus
~68K. The full FuzzMatrix gate is green (72/72), CI-style.

WHAT CHANGED

1. Char WOVEN into every tier (was excluded — no NumPy char dtype).
   - gen_oracle.char_tier(mode): generates each Char op through the uint16 NumPy proxy
     and relabels uint16->char (bytes intact), appending the cases into the SAME tier
     file as their native kin (binary_arith/unary/bitwise/comparison/reduce/scan/stat/
     manip/sort/tail/astype_full). The existing per-tier [FuzzMatrix] tests now assert
     NumSharp's Char === uint16. 3,726 Char cases woven, all green.
   - Replaces the earlier bolt-on `char` mode + char.jsonl + FuzzCorpusTests.Char.

2. Native dtype subsets widened to ALL_DTYPES where the op is type-general:
   SORT/STAT/SCAN/MANIP/TAIL/PARAM/CNZ/NAN_REDUCE/LOGIC/ALIAS/COPYTO (CLIP keeps every
   ordered dtype; TAIL/ALIAS exclude bool — gen subtracts, NumPy bans bool `-`).

3. Decimal differential coverage via an INDEPENDENT C# oracle (no NumPy analog):
   - test/oracle/gen_decimal_oracle.cs computes every expected value with naive scalar
     System.Decimal arithmetic (NOT NumSharp kernels) and emits the harness JSONL schema
     -> decimal_{unary,binary,reduce,scan}.jsonl (234 cases: 4 unary, 8 binary arith +
     6 comparison, 5 reductions, 2 scans, over 13 single + 7 pair layouts).
   - BitDiff tokenizes Decimal by canonical VALUE (scale-insensitive: 1.0m === 1.00m),
     since there is no NumPy decimal scale to match. ALL GREEN — no decimal kernel bug.

HARNESS
   - FuzzCorpus.DtypeToTC: map "char" and "decimal".
   - OpRegistry.ApplyArgsort: wire Boolean/Decimal/Complex (were harness gaps, now green).
   - FuzzCorpusTests: 4 Decimal* tiers; Char-woven note.

BUGS FOUND (carved out of the green corpus, reproduced under [OpenBugs], CI-excluded;
remove the carve + test when fixed):
   OpenBugs.Char.cs (OpenBugsCharTests):
     - promote(Char,Byte) -> Byte: Char ranks below uint8 in the promotion table, so
       Char x uint8 truncates the Char's high byte. Corrupts arithmetic (char[321]+uint8[65]
       -> Byte 130, not uint16 386) AND comparisons (greater(321,65) -> False, equal -> True).
     - reciprocal(char) -> Double (should be uint16/char integer reciprocal).
     - power(char, float32) -> Double (should be float32; Char mis-ranked above float32).
     - power(char, ...) scalar path -> InvalidCastException (kernel calls Convert.To*(char)).
     - bitwise_*(bool, char) -> KeyNotFoundException '(Boolean, Char)' (unregistered kernel).
     - invert(char) N>=16 -> NotSupportedException (Vector256<ushort> path omits Char;
       N<=15 scalar path works).
   OpenBugs.DtypeCoverage.cs (OpenBugsDtypeCoverageTests):
     - clip(bool) on non-contiguous (strided/transposed/F-contig) -> NotSupportedException
       (contiguous path handles bool; the general clip kernel omits it).

NON-BUGS (classified, documented carves — NOT OpenBugs):
   - complex self-multiply ULP: NumSharp matches NumPy's SCALAR z*z exactly; NumPy's own
     array ufunc disagrees by ~1 ULP on a catastrophic-cancellation _cbase value. Carved
     complex from ALIAS (ill-conditioned input, not a bug).
   - argsort<bool>/<Complex> "not supported" was an OpRegistry wiring gap, now fixed -> green.

Docs: Fuzz/README.md ledger + .claude/CLAUDE.md updated (corpus counts, char_tier,
gen_decimal_oracle.cs, value-aware decimal compare, [OpenBugs] dispositions).
… green)

Round out the independent C# Decimal oracle (gen_decimal_oracle.cs) to the
decimal-specific ops where bugs are most likely to hide. 68 new cases, all green —
NumSharp's Decimal kernels are bit-correct (value-wise) for every covered op.

  - power(decimal, int-exponent): exact repeated multiply / reciprocal oracle. 20 cases.
  - var (axis=None, ddof=0): mean((x-mean)^2), exact decimal arithmetic. 11 cases.
  - std (axis=None, ddof=0): sqrt(var) oracled by an INDEPENDENT Newton-Raphson decimal
    sqrt (NOT NumSharp's DecimalMath.Sqrt) — both converge to the same value to full
    decimal precision, validating DecimalMath.Sqrt. 11 cases.
  - matmul 2D@2D (incl. one F-contiguous B): naive triple-loop sum-of-products oracle
    (decimal + is exact, so accumulation order is irrelevant). 4 cases.
  - astype decimal->{int32,int64,float32,float64} (truncation toward zero) and
    {int32,int64,float64}->decimal: the cast kernel. 22 cases.

Decimal now spans 8 tiers / 302 cases (unary/binary/reduce/scan/power/varstd/matmul/
astype) over 13 single + 7 pair layouts. Full FuzzMatrix gate green (76/76).
…rity

Extend ToByteArray with an order parameter and add the NumPy-named tobytes()
alias, mirroring numpy.ndarray.tobytes(order='C'). Previously ToByteArray()
only emitted logical C-order bytes with no order control.

Order semantics (all probed against NumPy 2.4.2, src/numpy methods.c +
convert.c PyArray_ToString):
  'C'/'c' -> row-major
  'F'/'f' -> column-major
  'A'/'a' -> 'F' iff F-contiguous AND not C-contiguous, else 'C'
            (exactly NumPy's PyArray_ISFORTRAN test; reused via OrderResolver)
  'K'/'k' -> ALWAYS 'C' for the numeric dtypes NumSharp supports. This is a
            real NumPy quirk: tobytes routes non-reference dtypes through
            CopyInto into a C-contiguous destination (the F flag is only set
            for order=='F'), so tobytes('K') never preserves an F-contiguous
            source. Deliberately special-cased instead of OrderResolver's 'K'
            (which keeps an F source as 'F' for copy/flatten).
  invalid -> ArgumentException via OrderResolver (NumSharp's house mapping of
            NumPy's ValueError).

Implementation (no per-dtype switch; NDIter-driven):
  - Fast path: direct Buffer.MemoryCopy from Storage.Address when the view is
    already contiguous in the requested physical order AND offset==0. The
    offset==0 guard is required — simple contiguous slices are re-based (offset
    folded into Address) but strided/negative-stride/F-sliced views keep their
    start in Shape.offset, so Address would point at the wrong first element
    (verified: reversed [::-1] off=4, F-sliced T[...,1:] off=2).
  - Otherwise materialize via copy(physical) (NDIter copy primitive; absorbs
    scalar/(1,)/strided/broadcast/transposed uniformly across all 15 dtypes),
    then memcpy. copy('F') leaves a buffer whose linear bytes are the
    column-major readout.
  - GC.AllocateUninitializedArray: the buffer is fully overwritten by the copy,
    so the CLR zero-fill is pure waste — matches NumPy's uninitialized
    PyBytes_FromStringAndSize(NULL,n).

Validation:
  - 936/936 cases bit-exact vs NumPy oracle (13 comparable dtypes x 18 layouts
    x 4 orders: contig/F-contig/asfortran/strided/reversed/transpose/column/
    row/submat/broadcast/3D/scalar/empty/prefix/offset/1-elt/negcol/3D-transpose).
  - Char + Decimal (no NumPy equivalent) validated oracle-free via the
    metamorphic identity tobytes('F') == transpose(reversed-axes).tobytes('C').
  - 14 new MSTest cases (exact NumPy bytes + all-15-dtype rules + lowercase +
    invalid-order + empty/scalar/roundtrip/detached-copy); 17 existing
    ToByteArray tests still green on net8.0 + net10.0.

Performance (Release, warm best-of-21 vs NumPy 2.4.2, NPY/NS):
  contiguous fast path  2.9x-6.4x faster (10M-50M, i4/f8) — single memcpy +
    uninitialized alloc + .NET LOH page reuse (CPython returns large buffers
    to the OS and re-faults each call).
  materialize <=10M     1.6x-2.1x faster (strided/transpose).
  materialize 50M+      ~0.5x: the unmanaged->managed 2nd pass costs an extra
    out-of-cache DRAM sweep vs NumPy's single CopyInto-into-bytes; rare for
    tobytes (usually contiguous). Single-pass would need a per-dtype
    instantiation (against the no-dtype-switch rule) or byte-reinterpret
    plumbing; deferred.
…oracle grids (5 bugs found)

Close the highest-value gaps from the COVERAGE_GAPS.md audit: array→array ops that fit the
differential-oracle design but were never wired. 18 ops added across Batches 1-3; the
op corpus grows by ~1.7K cases; full FuzzMatrix gate green (77/77). Five real bugs caught,
carved from the green corpus and reproduced under [OpenBugs].

ADDED (green):
  Batch 1 (logic mode): logical_and/or/xor (binary→bool), logical_not (unary→bool),
    arctan2 (binary trig→float).
  Batch 2: sort (value sort, axis -1/0/1 → sort.jsonl), diagonal (matmul tier), ediff1d
    (scan tier), nanpercentile/nanquantile (nanreduce tier, finite+NaN data), round_/around
    (new rounding.jsonl, decimals 0/1/2).
  Batch 3: flatnonzero, argwhere (sort tier), allclose, array_equal (logic tier, 0-D bool),
    unique (sort tier, contiguous+finite).

BUGS FOUND → [OpenBugs] (OpenBugsDtypeCoverageTests):
  - trace(unsigned) → Int64 instead of uint64 (accumulator upcasts to the signed default;
    cf. sum(uint8)→uint64 which is correct). Trace_Unsigned_WrongResultDtype.
  - round_/around with NEGATIVE decimals: int loop THROWS ArgumentOutOfRangeException
    (System.Math.Round rejects digits<0), float mis-rounds. Round_NegativeDecimals_Broken.
  - round_ on float16 with decimals>=1 diverges from NumPy banker's rounding.
    Round_Float16_Fractional_Diverges.
  - iscomplex / isreal IGNORE the imaginary part (complex → all-real) and emit garbage on
    strided real input. IsComplex_IgnoresImaginaryPart / IsReal_IgnoresImaginaryPart.

CLASSIFIED NON-BUGS (carved, documented — not OpenBugs):
  - nanpercentile/nanquantile across inf: percentile INTERPOLATION over inf is ill-defined
    (inf-inf=NaN) — gave them finite+NaN data so they test the actual nan-skipping.
  - unique on raw-offset corpus views = the documented '#11 unreachable-via-API'
    representation gap (public-API unique is correct, verified); inf/NaN complex ordering is
    implementation-defined — gave unique contiguous+finite data instead.

Tracking doc: test/NumSharp.UnitTest/Fuzz/COVERAGE_GAPS.md (full np.* audit + Group A work list;
Batches 1-3 done, remaining: take/put/compress/extract, ravel_multi_index/unravel_index/indices,
convolve, flatten/rollaxis/append/insert, the split family, the decimal-specific ops).
…plit/index), all green

Complete the np.* half of Group A. New `groupa` tier (103 cases) exercises the shape,
selection, convolution, multi-output split, and index-transform ops that fit the oracle
but were unwired. All GREEN — no bugs. Full FuzzMatrix gate 78/78.

ADDED (green, groupa tier):
  shape:      flatten (C-order copy, incl. non-contiguous source), rollaxis, append, insert
  selection:  take (int64 indices, axis), compress (bool cond), extract (bool mask), put (mutate)
  math:       convolve (full/same/valid, 1-D)
  multi-out:  split / hsplit / vsplit / dsplit (one case per output piece)
  index:      ravel_multi_index (coords->flat), unravel_index (flat->coords, per-dim piece)

np.* Group A is now complete: 33 ops wired across Batches 1-6 (7 -> [OpenBugs], the rest green).
Only `indices` (creation-shaped, no operand -> Group C) and the decimal-specific ops
(extend gen_decimal_oracle.cs) remain. Tracking: Fuzz/COVERAGE_GAPS.md.
…rity)

Investigating the audit item "np.array(byte[]) -> int8": that path is already
correct (C# byte -> uint8/Byte, sbyte -> int8/SByte across every np.array overload,
and AsNumpyDtypeName maps them right). The real, closely-related defect is in
np.min_scalar_type, which falsely assumed "NumSharp doesn't have int8/sbyte" and
never returned float16 (Half) — both types that DO exist among the 15 dtypes.

Bugs fixed (verified against NumPy 2.4.2 convert_datatype.c:min_scalar_type_num):

1. Negative int8-range scalars widened to Int16 instead of SByte (int8).
   min_scalar_type(-1..-128) returned Int16; NumPy returns int8. Affected every
   signed input (sbyte/short/int/long) since they route through MinTypeForSignedInt.
   The sbyte arm (sb>=0 ? Byte : Int16) had the same widening.

2. Float/double values NEVER returned Half (float16). The old code used
   exact-float32-representability, which both (a) missed float16 entirely and
   (b) diverged from NumPy, e.g. min_scalar_type(0.1) returned Double. NumPy demotes
   floats by RANGE (magnitude), allowing precision loss:
     |v| < 65000 (or non-finite) -> float16 ; |v| < 3.4e38 -> float32 ; else float64.
   Bounds are exclusive and copied verbatim from NumPy (65000, not 65504).

3. Added explicit Half input -> Half (NumPy NPY_HALF: float16 stays float16).

promote_types/result_type were checked and are NOT affected (they promote existing
types, not value-based smallest-type inference) — 11/11 narrow-type pairs match NumPy.

Validation:
- New NumPy-oracle differential over 122 cases (all 15-dtype boundaries incl. int8
  negatives, float16 range, exclusive cutoffs) — 0 divergences.
- Corrected two tests that pinned the old buggy behavior (min_scalar_type(-1)->Int16,
  1.0f->Single, NaN/Inf->Single, -10->Int16) to NumPy truth, per project DOD.
- Added np.array C#-type -> NumPy-dtype-name regression (byte[] -> uint8 never int8,
  sbyte[] -> int8, full 12-type matrix) to lock the originally-reported item.
- Full suite green: 11259 passed / 0 failed (net10.0); affected classes green on net8.0.
… C-order fix

Rewrite NDArray.tofile from a binary-only, single-string-overload stub into the
full NumPy ndarray.tofile(fid, /, sep='', format='%s'). Adds a Stream overload,
text mode (sep + Python %-format), and — critically — fixes a data-corruption bug.

BUG FIX (binary mode, non-contiguous arrays):
  The old tofile blindly wrote this.Array.Address for this.Array.BytesLength — the
  RAW underlying buffer. A sliced/strided/transposed/broadcast view therefore leaked
  the wrong bytes and wrong length (e.g. arange(10)[::2] wrote all 10 elements instead
  of the 5 logical ones). NumPy guarantees "Data is always written in C order,
  independent of the order of a". Binary mode now writes the logical C-order bytes:
  contiguous offset-0 views stream straight from Storage.Address (no managed copy);
  every other layout is materialized C-contiguous via copy('C') first — mirroring
  NumPy's PyArray_ToFile (whole-buffer fwrite when ISCONTIGUOUS, else a C-order walk).

API parity (probed against NumPy 2.4.2 methods.c + convert.c):
  - tofile(string fid, sep='', format='%s')  — creates/truncates the file ("wb").
  - tofile(Stream, sep='', format='%s')       — writes at the current position, leaves
    the stream open (the file-object form).
  - sep=="" (default) => binary; sep!="" => text, C-order iterate, "format % item"
    joined by sep with NO trailing separator.
  - format default "%s" (== "") => the NumPy scalar string via ArrayFormatter.ScalarStr
    (proven == str(np.scalar): floats keep ".0", 1e+20, complex "(1+2j)"/"1j", etc.).

New PrintfFormatter (Backends/Printing): a focused port of CPython str.__mod__ as used
by tofile — conversions d i u f F e E g G s x X o c %, flags - + space 0 #, width and
.precision. Python semantics reproduced (probed): float truncation under %d, 2-digit-min
lowercase/uppercase exponent for %e/%E, C-style %g with trailing-zero stripping, signed
%x/%X/%o with "#" prefixing even zero ("%#x" % 0 == "0x0"), Python's zero-padding of
inf/nan under the 0 flag ("%08.2f" % inf == "00000inf"), complex->real for real convs.

Drive-by fix (ArrayFormatter.PythonComplexRepr): str() of a pure-imaginary complex with
a +0.0 real part force-prepended a "+" — str(0j) gave "+0j" (and 1j would give "+1j").
Now matches CPython: pure-imaginary form carries the imaginary's own sign ("0j","1j",
"-1j","-0j"); a -0.0 real still takes the parenthesized form ("(-0+0j)"). This is a
pre-existing 0-d complex scalar-str bug that also surfaced through tofile text mode.

Validation:
  - 558/558 tofile cases bit/byte-exact vs NumPy (13 dtypes x 6 layouts x binary + text
    %s/%d/%.3f/%e/%g/%08.2f/%+.1f).
  - PrintfFormatter: 446/446 format-x-value cases match Python (float64 + int64 across
    the conversion/flag/width/precision matrix, incl. inf/nan and round-half-to-even).
  - Char + Decimal (no NumPy analog) self-consistent (binary == ToByteArray('C');
    text = per-element ScalarStr).
  - 16 new MSTest cases; full CI suite green (11275 passed) incl. 204 printing/ToString
    parity tests (the complex fix is consistent with the ~18K-case printing port).

Performance (Release, warm best-of-N vs NumPy 2.4.2, NPY/NS):
  binary contiguous 10M   2.48x   binary strided 10M  10.84x (NumPy per-element fwrite)
  text %.4f 200K          1.92x   int text            ~5x    float %s text  1.37x
  Float-%s text is the one sub-1.5x case: both libraries are bound by the same
  shortest-float (Dragon4) algorithm; batched writer flushing keeps the rest of the
  per-element overhead off the hot path.
…oat32/16 sci threshold

Two NumPy-parity bugs surfaced by differential fuzzing np.tofile against NumPy 2.4.2
(binary + text, ~2500 format-x-value-x-dtype cases). Both live in the shared printing
layer, so they also affected 0-d scalar str, not just tofile.

1) PrintfFormatter '#' (alternate) flag dropped the decimal point on f/e/g.
   Python keeps a decimal point under '#' even when no fractional digits are emitted:
     "%#.0f" % 3      -> "3."        (was "3")
     "%#.0e" % 3      -> "3.e+00"    (was "3e+00")
     "%#g"   % 100000 -> "100000."   (was "100000")
   Fix: thread `alt` into FormatFixed / FormatScientific and insert the point when the
   mantissa/number has none; FormatGeneral already suppressed trailing-zero stripping
   under '#' but must likewise force the point. 84/738 stress cases were wrong; now 738/738
   match Python (incl. %g power-of-10 carry and %.Nf round-half-to-even, already correct).

2) ArrayFormatter.PythonFloatRepr used float64's positional/scientific window for ALL
   float dtypes. NumPy's scalar repr (scalartypes.c.src @name@type_@kind@_either) is
   positional iff 1e-4 <= |value| < max_positional, where max_positional is PER DTYPE
   (float16 -> 1e3, float32 -> 1e6, float64 -> 1e16) and the cutoff is tested on the VALUE:
     str(np.float32(1e15))  -> "1e+15"     (was "1000000000000000.0")
     str(np.float32(1e-4))  -> "1e-04"     (was "0.0001"; float32(1e-4) rounds to 9.999e-5 < 1e-4)
     str(np.float16(65500)) -> "6.55e+04"  (was "65500.0")
   For float64 the new value-based test is exactly equivalent to the old decExp in [-4,16)
   check, so double output (and the ~18K-case array-print fuzz) is unchanged; only the
   float32/float16 0-d scalar / tofile-%s path is tightened to its true window. This was a
   latent printing-port bug: the array-printing path uses FloatingFormat (its own exp logic),
   so the scalar-only PythonFloatRepr escaped the earlier fuzz. 865/865 f16/f32/f64 scalar-str
   fuzz cases now match NumPy.

Also documents PrintfFormatter's deliberate leniency where CPython/NumPy RAISE (%x/%o on
bool/float, %c on float, %d on inf/nan, conversion-count != 1): a per-element file writer
returns a best-effort rendering instead of aborting mid-stream. These remain the only
differential-fuzz divergences and are all cases where `format % item` throws in Python
(and NumPy's own tofile even segfaults on %d-with-inf).

Tests: 2 new regression cases (# flag, float32/16 sci threshold at both the tofile and 0-d
ToString surfaces). Full CI suite green (11277 passed) incl. 204 printing/ToString parity tests.
… (NumPy parity)

Follow-up to the np.min_scalar_type fix: swept the whole dtype-resolution surface
with full differential matrices against NumPy 2.4.2 and fixed every sibling of the
"narrow type (int8/float16) gets avoided/widened" bug. Probes run: promote_types
13x13 (169), can_cast 13x13x5 casting modes (845), issubdtype concrete x abstract
(117), reductions+unary per dtype (300), binary ufuncs 12x12x7 (1008). promote_types
and the reduction/unary-math tiers were already correct; the bugs below were not.

1) NPTypeHierarchy._concreteParent was MISSING BOTH narrow types.
   SByte (int8) and Half (float16) had no entry, so GetImmediateKind() returned
   Generic for them — breaking every consumer:
     - issubdtype(int8, signedinteger/integer/number) -> False   (NumPy: True)
     - issubdtype(float16, floating/inexact/number)   -> False   (NumPy: True)
     - can_cast same_kind involving int8/float16 (see #2)
     - maximum_sctype(int8)/(float16) -> themselves   (now int64/float64)
   Added [SByte]=SignedInteger and [Half]=Floating (+ GetMaximumType arms).
   issubdtype differential: 117/117 after (int8/float16 rows now pass).

2) can_cast(..., "same_kind") used a SYMMETRIC "same category" model, diverging
   from NumPy's DIRECTIONAL kind ordering (dtype_kind_to_ordering in
   convert_datatype.c): bool(0) < unsigned(1) < signed(2) < float(4) < complex(5),
   allowed iff a safe cast OR KindOrder(from) <= KindOrder(to). 32/845 cases were
   wrong, e.g.:
     - int16 -> int8   was False (NumPy True: signed -> signed)
     - int16 -> uint8  was True  (NumPy False: signed -> unsigned is DOWN a kind)
     - int32 -> float  was False (NumPy True: int -> float)
     - float -> float16 direction handled; complex -> float rejected.
   Added NPTypeHierarchy.KindOrder + CanCastSameKindOrder; can_cast now ORs safe
   with the ordering. can_cast differential: 845/845 after. (Kept the symmetric
   IsSameKind for its own callers/tests; it is no longer used by can_cast.)

3) Bool-input ufuncs with NO bool loop returned bool instead of int8. NumPy's
   power/floor_divide/remainder/square loops start at int8, so bool operands
   promote (probed 2.4.2, and already the documented rule for np.evaluate — the
   DIRECT ops just never applied it):
     - power(bool,bool)        -> int8 [True**True=1, False**False=1]
     - floor_divide(bool,bool) -> int8 [1//1=1, 0//0=0]
     - mod(bool,bool)          -> int8 [1%1=0]
     - square(bool)            -> int8 [True->1, False->0]
   Extended the existing shift-op bool->int8 remap in ExecuteBinaryOp to cover
   FloorDivide/Mod/Power, fixed ResolvePowerResultType (power's scalar-exponent
   fast paths), and Default.Square. Scoped strictly to resultType==Boolean (i.e.
   both operands bool) so add/multiply/etc. are untouched. binary differential:
   1008/1008 after; values verified against NumPy.

Also corrected two tests that pinned the OLD buggy same_kind behavior
(can_cast(int32, float32, "same_kind") asserted False; NumPy says True) — they
contradicted other tests in the tree that already expected the correct direction.

Validation:
 - Differentials vs NumPy 2.4.2 all green after fixes: promote 169, can_cast 845,
   issubdtype 117, binary 1008, reductions+unary (only np.rint diverges — a MISSING
   ufunc in NumSharp, not a dtype bug; noted, out of scope).
 - New Casting/NarrowDtypeParityTests.cs (9) pins issubdtype/maximum_sctype/
   can_cast-directional + the four bool->int8 op cases (dtype AND values).
 - Full suite 11286 passed / 0 failed (net10.0), FuzzMatrix gate 79/79 bit-exact,
   affected classes green on net8.0.
…rt/manip (Group A complete)

Closes the final Group A item: extend the INDEPENDENT C# Decimal oracle
(gen_decimal_oracle.cs) to the remaining decimal-supported ops. Decimal is the one
NumSharp numeric dtype with no NumPy analog, so the generator itself is the oracle —
every expected value is computed with naive scalar System.Decimal math (no NumSharp
kernels), then the harness replays the operand through NumSharp's decimal KERNELS and
value-compares (BitDiff tokenizes decimal by canonical value, so 1.0m ≡ 1.00m).

New / extended tiers (12 total, 579 cases, all green):
  - decimal_unary  (+floor/ceil/trunc via decimal.Floor/Ceiling/Truncate — exact base-10)
  - decimal_scan   (+diff n=1,2 along the last axis; DiffAxis oracle)
  - decimal_stat   (NEW, 170): clip = Max(lo,Min(hi,x)); order stats median/ptp/
                    percentile/quantile (axis=None -> scalar). Oracle = naive sort +
                    NumPy 'linear' interpolation in EXACT decimal (Quantile/Median).
  - decimal_where  (NEW, 4): where(cond,a,b) 16-byte conditional-copy over contig+strided
  - decimal_sort   (NEW, 7): sort along an axis (1-D/2-D, contig+strided; SortAxis oracle)
  - decimal_manip  (NEW, 36): ravel/transpose/reshape — value-preserving reindex forcing
                    the strided decimal materialize/copy path (compared C-contiguous)

Every oracle formula was validated bit-identical against NumSharp's decimal path BEFORE
generating the corpus (median even/odd n, ptp, quantile/percentile q in {0,.25,.5,.75,1},
clip, sort, diff, floor/ceil/trunc, where, reshape/ravel/transpose) — de-risked, no
false divergences. Zero harness changes: all these ops are already dtype-generic in
OpRegistry, so the decimal cases flow through the existing dispatch.

nan* reductions intentionally skipped: System.Decimal cannot represent NaN, so
nansum/nanmax/... are byte-identical to plain sum/max/... (already covered by
decimal_reduce; verified np.nansum(decimal) == np.sum(decimal)).

Note: the shared n++ case-ID counter shifts IDs in the pre-existing decimal tiers
(binary/reduce/power/varstd/matmul/astype); those diffs are ID-relabel ONLY —
operand/expected buffers are unchanged (verified by id-stripped diff).

Gate: FuzzMatrix 65/65 green on net10.0 (4 new decimal test methods:
DecimalStat/DecimalWhere/DecimalSort/DecimalManip). COVERAGE_GAPS.md: Group A closed.

(cherry picked from commit 48ebfa4fcc2ce57bcedcdfc14ba75108bef845c8)
…nversion parity gap

Audited the ENTIRE type-conversion surface against NumPy 2.4.2 and found the value
side fully on-parity; the one gap was a missing API parameter, closed here.

Audit (all differential vs NumPy 2.4.2, bit-exact):
 - astype 13 NumPy dtypes, 13x13, FRESH independent oracle (aggressive edges distinct
   from the committed corpus: 0.4999/0.5/0.5001 boundaries, subnormals, 1e300, all int
   overflow points, NaN/±inf, complex) — 169/169. (Plus the committed astype_full corpus
   5070 + FuzzMatrix, green.)
 - Char (uint16 masquerade): char<->X byte-identical to uint16<->X — 23/23.
 - Decimal->X (all 13 numeric dst incl. modular int overflow + float16 Inf): matches the
   NumPy-verified double path — 442/442.
 - X->Decimal: int exact + round-trip; float exact in-range.
 - np.array(dtype=) / copyto vs astype — 12/12.
 - can_cast (np.can_cast AND the separate NDIterCasting.CanCast used by copyto/ufuncs):
   each 845/845 vs NumPy across 13x13x5 casting modes.

The gap: ndarray.astype had no `casting=` parameter — it always cast unsafely. NumPy's
astype takes casting='unsafe' (default) and raises TypeError when a stricter rule
('no'/'equiv'/'safe'/'same_kind') forbids the cast.

Fix: added `string casting = "unsafe"` to both full astype overloads
(NPTypeCode and Type). Default 'unsafe' is a NO-OP short-circuit — 100% backward
compatible, every existing caller is unchanged. A stricter rule validates through the
hardened np.can_cast (bit-exact vs NumPy across all 15 dtypes) and raises
InvalidCastException — same exception type and message shape as np.copyto:
  "Cannot cast array data from dtype('int32') to dtype('int16') according to the rule 'safe'"
Verified vs NumPy: int32->int64 safe OK; int32->int16 safe raises, same_kind OK;
int32->float32 safe raises, same_kind OK; float64->int32 safe/same_kind raise, unsafe OK.

Documented finding (NOT changed — no NumPy analog): float->Decimal for NaN/±Inf/overflow
(|v| >= ~7.9e28) silently yields 0. System.Decimal has no NaN/Inf and a smaller range,
so there is no NumPy behavior to match; left as-is pending a decision (0 / throw / saturate).

Tests: Casting/AstypeCastingParamTests.cs (8) — default-unsafe, safe/same_kind/unsafe
outcomes, message shape, Type overload, same-dtype no-op, and value-invariance for allowed
casts. Full suite 11314 passed / 0 failed (net10.0); FuzzMatrix astype corpus green; net8.0 green.
…g UnaryOp.Round

Implements np.rint, the previously-missing float-tier rounding ufunc surfaced during
the int8/float16 dtype sweep. NumPy 2.4.2 is the oracle (probed + generate_umath.py:
rint has only e/f/d/g/F/D/G loops — no integer/bool loops).

rint vs np.round/around: SAME value kernel (round-half-to-even == Math.Round default
== the existing UnaryOp.Round, already internally named "rint" in UfuncName), but a
DIFFERENT dtype rule. around preserves integer dtype (int->int identity); rint is
float-tier like sqrt/sin: bool/int8/uint8 -> float16, int16/uint16/char -> float32,
int32/uint32/int64/uint64 -> float64, floats/complex/decimal preserved. So it maps to
ResolveUnaryFloatReturnType, not the dtype-preserving path.

Implementation (reuses existing infra, no new kernel/UnaryOp):
- np.rint.cs — NumPy-shaped overload rint(x, out=, where=, dtype=) + positional-dtype
  convenience forms (modeled on np.trunc.cs).
- Default.Rint.cs — one-liner: ExecuteUnaryOp(nd, UnaryOp.Round,
  ResolveUnaryFloatReturnType(nd, typeCode, "rint"), out, where). Mirrors Default.Sin.
- TensorEngine.cs — two abstract Rint overloads next to Truncate.
- Fixed a REAL latent gap: EmitUnaryComplexOperation threw "not supported for Complex"
  for UnaryOp.Round, so np.around(complex) was ALSO broken. Added the complex Round
  case (rounds real & imag separately, half-to-even) — np.rint(complex) and
  np.around(complex) now both work.

Dtype/value/layout/param parity (all verified vs NumPy 2.4.2):
- 13-dtype tier check; half-to-even values (0.5->0, 2.5->2, -2.5->-2, 2.6->3); nan/inf
  preserved; complex real+imag; decimal; int-input identity-as-float.
- Layouts: strided, negative-stride, broadcast, transpose, empty, 0-d scalar.
- out= (same instance + fill), where= (masked slots keep prior out), dtype= (loop runs
  at that dtype; dtype=<int> raises the no-loop error).

Performance (Release, best-of, reusing the SIMD UnaryOp.Round vector path): at scale
NumSharp is competitive-to-faster — @1m float32/float64 ~5-6x NumPy, @10m float32 3.7x,
float64 1.25x, complex 1.77x; float16 ~parity (scalar path, no Vector<Half> in .NET) and
int paths are widening-cast bound. Same profile as np.around (shared kernel). NumPy's
single-instruction _mm256_round wins only on tiny L2-resident arrays (NumSharp's shared
unary-dispatch overhead), which amortizes by ~1M elements.

Tests:
- Math/np.rint.Test.cs (16) — tiers, half-to-even, nan/inf, complex, decimal, layouts,
  out/where/dtype, plus Around_Complex_NowSupported (the bonus fix).
- Fuzz: rint added to gen_oracle UNARY_EXTRA_OPS + OpRegistry; regenerated unary_extra
  corpus (+364 rint cases). FuzzMatrix gate replays them bit-exact.

Docs: CLAUDE.md Math-Arithmetic list + ufunc out=/where= list updated.

Full suite 11314 passed / 0 failed (net10.0), FuzzMatrix green, net8.0 rint green.
…= casting, metamorphic

Extends Math/np.rint.Test.cs from 16 to 31 tests, all pinned to NumPy 2.4.2 output,
covering the subtle parity points beyond the fuzz corpus:

- Signed zero preserved (f64+f32): rint(-0.4)/rint(-0.5)/rint(-0.0) -> -0.0 (signbit
  True), rint(0.4) -> +0.0; subnormal underflow rint(-1e-300) -> -0.0. Verified NumSharp
  Math.Round matches NumPy signbit exactly.
- Half (float16) half-to-even values; Char (uint16 proxy) -> float32.
- All 15 supported dtypes -> NumPy-parity result dtype (adds Char + Decimal to the tier check).
- Higher-rank 2D values + F-contiguous input.
- out= same_kind narrowing cast (f64 loop -> f32 out) returns the f32 out; out= float->int
  raises (not same_kind); in-place out=a aliasing.
- where= broadcasts (2,) mask across a (2,2) output; masked-off slots keep prior contents.
- Metamorphic (oracle-free): rint idempotent, and odd (rint(-x) == -rint(x)).
- Large integral magnitudes (2^53, +/-1e16) unchanged.

Full suite 11329 passed / 0 failed (net10.0); rint class green on net8.0.
…ype/Stream

Answers "do we support tofile/fromfile to full extent?": tofile was complete, but
fromfile was binary-only with a required dtype — the pair was asymmetric (you could
tofile a text or non-contiguous file but not read it back). This brings fromfile up to
NumPy's fromfile(file, dtype=float, count=-1, sep='', offset=0).

Was: fromfile(string file, NPTypeCode|Type dtype) — reads the WHOLE file as binary only.
Now (backward-compatible; the old 2-arg calls still resolve):
  - dtype defaults to float64 (NumPy's default) and may be omitted.
  - count: read N items; -1 (default) reads the rest. Past EOF reads what's present.
  - offset: skip N bytes from the current position (binary only; text+offset raises,
    like NumPy).
  - sep: text mode. Splits NumPy-style — a whitespace-only separator splits on any
    whitespace run; otherwise split on the separator's non-whitespace core with the
    surrounding whitespace treated as a wildcard (matches swab_separator), trailing
    separator ignored. Integer tokens WRAP into the dtype (int8 "300"->44, uint8 "-1"
    ->255) like NumPy's scanf loop; floats accept sci / nan / inf; bool parses as int
    (nonzero => True); malformed data raises ValueError.
  - Stream overload (the file-object form): reads from the current position, left open.

NaN parity: NumPy/C text "nan" is the POSITIVE quiet NaN (0x7FF8…); .NET's double.NaN is
the NEGATIVE one (0xFFF8…). fromfile emits the positive pattern so bytes are identical to
NumPy and a narrowing cast lands on float 0x7FC00000 / half 0x7E00.

Complex: reads the bare "a+bj" form NumPy accepts AND, as a superset, the parenthesized
"(1+2j)" form its own tofile writes — so complex text ROUND-TRIPS in NumSharp (NumPy's
text reader errors on the parenthesized form, so it does not round-trip there).

Binary fast path: a seekable source (every filename) reads straight into one exact-sized
buffer that the array then views (pinned) — a single disk->buffer copy, versus the
MemoryStream growth + ToArray double copy of the streaming fallback (non-seekable streams).

Validation (differential vs NumPy 2.4.2): 35 oracle cases bit-exact (binary count/offset/
EOF-clamp/default-dtype/reinterpret + text sep/whitespace/trailing/count/sci/nan/inf/
int-wrap/bool/complex/empty/malformed-error/offset-error) + Stream + default-dtype + 15
binary round-trips (all dtypes, incl. sliced views) + 6 text round-trips. 15 new MSTest
cases; full CI suite green (11344 passed).

Performance (Release, warm, NPY/NS): binary read 10M int32 = 1.18x (single-copy/disk bound,
both do one copy); text parse 200K float64 = 4.79x.
…the sole NumPy byte export

Per "Breaking Changes OK to match NumPy": remove the pre-parity backward-compat shim so the
NumPy-named tobytes is the one and only byte-export API.

Source
- NdArray.ToByteArray.cs -> NdArray.tobytes.cs: move the implementation into
  tobytes(char order = 'C') and DELETE the public ToByteArray(char) method (tobytes was
  previously just a thin alias delegating to it). Body is byte-for-byte identical, so
  behavior is unchanged; this is a pure rename/removal.
- np.tofile.cs: doc-comment reference ToByteArray('C') -> tobytes('C').

Old-implementation test removed (user-directed)
- Delete test/NumSharp.UnitTest/APIs/np.tofromfile.Test.cs — the 2019-era uint8/uint16
  tofile<->fromfile round-trip. Its coverage is fully subsumed (all 15 dtypes + views) by
  NDArray.tofile.Test.cs (RoundTrip_*_AllDtypes) and NDArray.fromfile.Test.cs
  (RoundTrip_Binary_AllDtypes_IncludingViews).

Tests migrated ToByteArray -> tobytes (coverage preserved, not reduced)
- NDArray.ToByteArray.Test.cs -> NDArray.tobytes.Contract.Test.cs
  (class ToByteArrayTests -> TobytesContractTests): the C6 non-contiguous-view regression
  suite, unchanged in substance.
- NDArray.tobytes.Order.Test.cs: DefaultOrder_IsC_AndAliasMatches -> DefaultOrder_IsC
  (the alias-equality assertions are moot once the alias is gone; the default==C check stays).
- NDArray.ToArray.Test.cs, NDArray.tofile.Test.cs, NDArray.fromfile.Test.cs,
  NumpyByteContractTests.cs: call sites updated to tobytes.

Note: fromfile's Type + NPTypeCode overload pair is KEPT — it is the established NumSharp
idiom (frombuffer carries the identical pair) and NumPy's fromfile(file, dtype, ...) signature
is a superset of the old 2-arg call, so it cannot be "de-compatibilized" without breaking parity.

Validation: net10.0 and net8.0 both Passed 11342 / Failed 0 / Skipped 11
(CI filter TestCategory!=OpenBugs and !=HighMemory and !=LargeMemoryTest).
The XML <remarks> on the `UnmanagedMemoryBlock(T* ptr, long count)`
constructor claimed "Does claim ownership." — the exact opposite of what
the constructor does, and a direct contradiction of its own <summary>
("Construct as a wrapper around pointer ... without claiming ownership").

The body sets `_disposer = Disposer.Null` (AllocationType.Wrap), so
disposing the block never frees `ptr`; the caller owns the memory's
lifetime. A reader trusting the old remark could reason their way into a
double-free (expecting the block to free it) or a use-after-free
(expecting the block NOT to, when it never would). Rewrote the remark to
state the non-owning contract explicitly.

Comment-only change; no behavioral impact.
Regression guards for typed data extraction on a dtype MISMATCH, pinning
NumSharp's contract against the Numpy.NET interop finding "C3": there,
`NDarray.GetData<int>()` on an int64 array silently REINTERPRETS the raw
buffer as int32 and truncates to the element count
([1, 5e9, -1] -> [1, 0, 705032704], i.e. numpy's `a.view(np.int32)[:3]`
byte garbage) — data corruption.

NumSharp must never behave that way. The tests lock in:
  * NDArray.GetData<T>() performs a VALUE CAST equal to numpy's
    `a.astype(...)` — int64->int32 truncates by value ([1, 705032704, -1],
    NOT the reinterpret [1, 0, 705032704]); float64->int32 truncates
    toward zero with overflow wrap ([1, -2, -2147483648]); int32->int64
    widens; matching dtype is a passthrough.
  * NDArray.ToArray<T>() is STRICT and throws ArrayTypeMismatchException
    on mismatch (no silent conversion).

Oracle values produced with NumPy 2.4.2. Category: normal (green) — these
document the intended, correct behavior, not an OpenBug.
…ark coach

DocFX site updates:

* api/index.md — "Supported Data Types" was stale at 12 numeric types.
  Corrected to the 15 public array dtypes: added the three that were
  missing (SByte -> np.int8, Half -> np.float16,
  Complex -> np.complex128) and a note that NPTypeCode.Empty/String and
  the Float alias are enum/compat values (Float resolves to Single), not
  additional array dtypes.

* api/overwrites/NumSharp.md — new DocFX overwrite supplying a
  summary/remarks block for the `NumSharp` namespace landing page
  (NDArray, the np facade, Shape/Slice/NPTypeCode, NumPyRandom) plus a
  short broadcasting/slicing example. Wired in via docfx.json:
  registered under `overwrite`, and excluded from the `content` glob so
  it is applied as metadata rather than rendered as a standalone page.

* toc.yml — added top-level "Benchmarks" (the dashboard page) and
  "Source Code" (GitHub) entries.

* docs/benchmarks-dashboard.md — added the "Click Here to see breakdown"
  onboarding coach: a one-time, localStorage-remembered bubble that
  points first-time viewers at the interactive breakdown targets
  (2x-5x status band, Reduction suite row, dtype-heatmap tail cell). It
  is IntersectionObserver-driven, keyboard-dismissable, degrades
  gracefully when storage is disabled, and self-dismisses after ~5.6s.
  Also removed the now-defunct `.guide-formula` box and folded its
  content into the "Reading Ratios" prose (ratio = NumPy / NumSharp,
  higher is better), matching the repo-wide NPY/NS convention.

* images/benchmark-dashboard.png — dashboard screenshot asset (also used
  by the README).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant