Skip to content

Partition PendingContext into operation state and actual pending state#1880

Open
TedHartMS wants to merge 11 commits into
mainfrom
tedhar/ctx-min
Open

Partition PendingContext into operation state and actual pending state#1880
TedHartMS wants to merge 11 commits into
mainfrom
tedhar/ctx-min

Conversation

@TedHartMS

Copy link
Copy Markdown
Contributor

This pull request introduces several improvements to in-place update logic, overflow detection, and memory management in the storage engine. The most significant changes are improved overflow handling in arithmetic operations, more precise management of record size information, and safer transfer of pending operation data. These changes enhance performance and correctness, especially in hot code paths, while also simplifying and clarifying the codebase.

Arithmetic overflow handling and in-place update logic:

  • Replaced checked arithmetic and exception handling in TryInPlaceUpdateNumber with a branchless signed overflow detection technique, eliminating expensive try/catch blocks and improving performance on the hot path. Now, overflows are detected using bitwise operations, and errors are flagged directly without exceptions. [1] [2] [3]

Record size and memory management:

  • Refactored InPlaceUpdaterWorker to declare RecordSizeInfo locally within each relevant switch case, removing the previously shared sizeInfo2 variable and clarifying where size information is needed. Added explicit calls to AssertOptionalsIfSet after size changes to ensure internal consistency. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]

Pending operation and output transfer:

  • Updated CompletedOutputIterator and CompletedOutput to transfer data from the pending IO slot rather than the pending context, ensuring that all relevant fields are moved and cleared properly. This change improves the safety and correctness of asynchronous IO handling. [1] [2]

Scan and IO preparation logic:

  • Refactored ConditionalScanPush and PrepareIOForConditionalScan to pass and use the key hash explicitly, and to set scan-specific payloads on the pending operation slot, improving clarity and correctness in scan operations. [1] [2] [3]

Async IO context cleanup:

  • Changed the async IO context return logic to reset both basePendingContext and slot fields, ensuring no stale references remain in pooled contexts.

@TedHartMS TedHartMS requested a review from badrishc June 16, 2026 06:37
@TedHartMS TedHartMS marked this pull request as ready for review June 19, 2026 02:34
Copilot AI review requested due to automatic review settings June 19, 2026 02:34

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors Tsavorite’s pending-IO plumbing by splitting the former PendingContext into (1) a small, stack-carried hot-path OperationState and (2) a pending-only PendingState stored on the pooled PendingIoContext. It also updates arithmetic in-place update logic, record-size bookkeeping, and related docs/benchmarks to improve correctness/perf on hot code paths.

Changes:

  • Partition pending-operation data into OperationState (hot-path) + PendingState (pending-only) and update the full pending/complete/re-pend flow accordingly.
  • Optimize in-place numeric updates (overflow detection) and tighten record-size/optionals consistency checks during in-place updates.
  • Update Tsavorite docs and Benchmarks (including adding a larger-than-memory benchmark variant + a benchmark configuration hook).

Reviewed changes

Copilot reviewed 33 out of 33 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
website/docs/dev/tsavorite/reviv.md Docs update: retry reuse now refers to OperationState rather than the old pending struct.
website/docs/dev/tsavorite/logrecord.md Docs update for PendingState/pending IO terminology and behavior.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/TsavoriteThread.cs Updates pending completion to use baseOperationState + pendingState and adjusted transfer/disposal semantics.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Tsavorite.cs Replaces stack-allocated PendingContext with stack-allocated OperationState across sync read/upsert/rmw/delete entrypoints.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/TryCopyToTail.cs Updates copy-to-tail path to accept/record OperationState fields.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/TryCopyToReadCache.cs Updates read-cache copy path to use OperationState.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalUpsert.cs Refactors Upsert hot path to OperationState and updates retry-reuse plumbing.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalRMW.cs Refactors RMW hot + pending path initialization to populate PendingState via OperationState.pendingOp.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalRead.cs Refactors Read hot + pending path initialization; adds explicit isFromPending plumbing for ReadInfo.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/InternalDelete.cs Refactors Delete hot path to OperationState.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/HandleOperationStatus.cs Centralizes pending op finalization around operationState.pendingOp; assigns IO id on the AsyncIOContext and handles diskLogRecord reuse/cleanup.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/FindRecord.cs Updates helper signatures/comments to use OperationState for pending-aware searches.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/EpochOperations.cs Updates SynchronizeEpoch to accept OperationState.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/ContinuePending.cs Refactors all ContinuePending* methods to operate on OperationState + PendingState; updates re-pend flow.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/ConditionalCopyToTail.cs Updates conditional copy-to-tail + IO prep to store conditional payload on PendingState and hot-path fields on OperationState.
libs/storage/Tsavorite/cs/src/core/Index/Tsavorite/Implementation/BlockAllocate.cs Moves flush-event + retry allocation tracking from old pending struct onto OperationState.
libs/storage/Tsavorite/cs/src/core/Index/Common/PendingState.cs Renames/refactors the former PendingContext to a pending-only PendingState payload (key/input/output/diskLogRecord, etc.).
libs/storage/Tsavorite/cs/src/core/Index/Common/PendingIoContext.cs Pending IO context now carries baseOperationState + pendingState instead of a single pending struct.
libs/storage/Tsavorite/cs/src/core/Index/Common/OperationStatus.cs Updates docs to reference OperationState in retry-handling remarks.
libs/storage/Tsavorite/cs/src/core/Index/Common/OperationState.cs New: defines the hot-path per-operation state that is carried on the caller stack and snapshotted into pending ops.
libs/storage/Tsavorite/cs/src/core/Index/Common/ExecutionContext.cs Pool reset now clears both baseOperationState and pendingState for safety.
libs/storage/Tsavorite/cs/src/core/Index/Common/CompletedOutput.cs Completed-output transfer now moves key/input/output from PendingState and metadata from OperationState.
libs/storage/Tsavorite/cs/src/core/Allocator/LogRecord.cs Doc update: record reuse mentions OperationState.
libs/storage/Tsavorite/cs/src/core/Allocator/ConditionallyHoistedKey.cs Doc update: references OperationState as the containing hot-path struct for sizing rationale.
libs/storage/Tsavorite/cs/src/core/Allocator/AsyncIOContext.cs Doc update: key shallow-copy comment references operationState.
libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorScan.cs Refactors conditional scan push to use OperationState + explicit keyHash flow and pendingState scan payload.
libs/storage/Tsavorite/cs/benchmark/KV.benchmark/README.md Updates benchmark docs to refer to OperationState instead of PendingContext.
libs/server/Storage/Functions/MainStore/RMWMethods.cs Refactors record size tracking to declare RecordSizeInfo per-branch and assert optionals consistency where applicable.
libs/server/Storage/Functions/MainStore/PrivateMethods.cs Replaces checked+try/catch overflow handling with branchless signed overflow detection in TryInPlaceUpdateNumber.
benchmark/BDN.benchmark/Operations/RawStringOperations.cs Minor benchmark method-body simplifications.
benchmark/BDN.benchmark/Operations/OperationsBase.cs Adds ConfigureServerOptions hook and validates LocalMemory device config for IO-path benchmarking.
benchmark/BDN.benchmark/Operations/LTM/RawStringOperations.cs New larger-than-memory RawString benchmark variant using LocalMemory device tiering + randomized fixed-width keys.
benchmark/BDN.benchmark/Network/RawStringOperations.cs Minor benchmark method-body simplifications (async).
Comments suppressed due to low confidence (1)

libs/storage/Tsavorite/cs/src/core/Index/Common/PendingState.cs:157

  • Typo in XML doc comment: "pending pendingState" has a duplicated word.

Comment on lines 78 to 82
// now below HeadAddress and there is at least one record below HeadAddress but above InitialLatestLogicalAddress. Reissue the Read(),
// using the LogicalAddress we just found as minAddress. We will either find an in-memory version of the key that was added after the
// TryFindRecordInMemory we just did, or do IO and find the record we just found or one above it. Read() updates InitialLatestLogicalAddress,
// so if we do IO, the next time we come to CompletePendingRead we will only search for a newer version of the key in any records added
// after our just-completed TryFindRecordInMemory.
Comment on lines +115 to 120
if (internalStatus != OperationStatus.RECORD_ON_DISK)
{
pendingState = newOp.pendingState;
newOp.pendingState = default;
sessionFunctions.Ctx.ReturnAsyncIOContext(newOp);
}
Comment on lines +317 to +322
if (status != OperationStatus.RECORD_ON_DISK)
{
pendingState = newOp.pendingState;
newOp.pendingState = default;
sessionFunctions.Ctx.ReturnAsyncIOContext(newOp);
}
SkipReadCache(ref stackCtx, out _); // Where this is called, we have no dependency on source addresses so we don't care if it Refreshed

// We don't have a pendingContext here, so pass the minAddress directly.
// We don't have a operationState here, so pass the minAddress directly.
`PendingState` carries information through the IO process and provides the source record (via its `DiskLogRecord`, which is an `ISourceLogRecord`) for RMW copy updates.

Previously `PendingContext` had separate `HeapContainers` for keys and values. However, for operations such as conditional insert for Copy-To-Tail or Compaction, we need to carry through the entire log record (including optionals). In the case of records read from disk (e.g. Compaction), it is easiest to pass the `LogRecord` in its entirety, including its `SectorAlignedMemory` buffer, in the `DiskLogRecord`. So now PendingContext will also serialize the Key passed to Upsert or RMW, and the value passed to Upsert, as a `DiskLogRecord`. `PendingContext` still carries the `HeapContainer` for Input, and `CompletedOutputs` must still retain the Key's `HeapContainer`.
Previously `PendingState` had separate `HeapContainers` for keys and values. However, for operations such as conditional insert for Copy-To-Tail or Compaction, we need to carry through the entire log record (including optionals). In the case of records read from disk (e.g. Compaction), it is easiest to pass the `LogRecord` in its entirety, including its `SectorAlignedMemory` buffer, in the `DiskLogRecord`. So now PendingState will also serialize the Key passed to Upsert or RMW, and the value passed to Upsert, as a `DiskLogRecord`. `PendingState` still carries the `HeapContainer` for Input, and `CompletedOutputs` must still retain the Key's `HeapContainer`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants