Improve parallelism by solving most difficult deals first#216
Improve parallelism by solving most difficult deals first#216tameware wants to merge 11 commits into
Conversation
tameware
commented
Jun 29, 2026
The heuristic extraction refactor changed weight_alloc_trump_void1's first branch from `lead_suit == trump` to `suit == trump`. Since that is exhaustive with the following `else if (suit != trump)`, the three ruffing branches (using the `24 - rank + ...` formula) became dead code, and trump ruffs were scored with side-suit discard weights instead. This mis-ordered ruffs, costing alpha-beta cutoffs. The effect is small for solve but compounds heavily in calc's warm-TT iterative deepening: calc explored ~34% more nodes than v2.9. Restoring the original `lead_suit == trump` pitch branch makes the ruffing branches reachable again and cuts calc time ~25% (gap to v2.9: 1.37x -> 1.02x). Ordering-only change; double-dummy results are unchanged. Co-authored-by: Cursor <cursoragent@cursor.com>
Per Copilot. Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
The heuristic/quick-tricks refactor introduced static_cast<unsigned char>
wrappers on values that v2.9 used as signed, changing search behavior:
- make_3 / make_3_ctx: winner[]/second_best[] .hand and .rank were cast
to unsigned char, turning the -1 "no card" sentinel into 255. This broke
winner[trump].hand == -1 style checks in QuickTricks, losing cutoffs.
- weight_alloc_trump_void2 / _void3: rel_rank[aggr[suit]][...] indexed
through static_cast<unsigned char>(aggr[suit]), truncating the 13-bit
aggregate holding to 8 bits and reading the wrong rel_rank row.
- QuickTricksPartnerHand{Trump,NT}: bit_map_rank index cast the signed
rank through unsigned char.
With these reverted to v2.9's signed handling, the per-move-generation
ordering trace now matches v2.9 exactly (0 divergences on list1), closing
the residual calc gap to parity. Ordering/pruning-only change; double-dummy
results are unchanged and all library tests pass.
Co-authored-by: Cursor <cursoragent@cursor.com>
The parallel board loop handed boards out in index order via an atomic counter, so a hard board picked near the end left one worker running long while the others sat idle. Hand out the hardest boards first (longest- processing-time-first) so the tail consists of cheap boards. parallel_all_boards_n gains an optional dispatch-order permutation: workers still pull from the same atomic counter, but the slot is mapped through the order before becoming a board number, so only the dispatch sequence changes and result placement is unaffected. The solve path passes no order and is unchanged. calc estimates per-deal difficulty with a cheap, trump-independent structural proxy (deal_fanout, mirroring Scheduler::Fanout) and sorts board indices by descending difficulty before dispatch. calc list1000 -n18: ~11.0s -> ~9.6s wall (~13%), user CPU unchanged. Co-authored-by: Cursor <cursoragent@cursor.com>
CalcDDtableN builds one board per strain for a single deal. deal_fanout is trump-independent, so all boards share one fanout and the difficulty sort is a pure no-op there. Gate the sort behind a difficulty_sort flag (default on for batch CalcAllTablesN) and disable it for the single-deal path. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
This PR aims to improve overall throughput for batch calculations by reducing “tail latency” in parallel workloads: it estimates deal difficulty cheaply, sorts boards hardest-first, and adds an optional dispatch-order mechanism to the parallel board runner.
Changes:
- Extend
parallel_all_boards_n()to optionally dispatch boards in a caller-provided order. - Add a cheap per-deal “fanout” estimate and use it to stable-sort batch
calcboards hardest-first before parallel execution. - Simplify/remove several legacy
static_cast<unsigned char>(...)conversions in solver/heuristic code paths.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| library/src/system/parallel_boards.hpp | Adds optional order parameter to control dispatch order (hardest-first, etc.). |
| library/src/system/parallel_boards.cpp | Implements ordered dispatch via slot→board mapping. |
| library/src/calc_tables.cpp | Computes per-deal difficulty estimate and dispatches hardest boards first for batch calc. |
| library/src/heuristic_sorting/heuristic_sorting.cpp | Cleans up heuristic code (including rel-rank indexing casts) and adjusts some void/trump logic. |
| library/src/quick_tricks.cpp | Removes redundant casts when indexing with abs_rank[..].rank. |
| library/src/ab_search.cpp | Removes redundant casts when copying abs_rank winner/second-best into Pos. |
| // Map a dispatch slot to the board number to process. With an order, hand out | ||
| // boards in that sequence (e.g. hardest first); otherwise in index order. | ||
| const bool use_order = | ||
| (order != nullptr && static_cast<int>(order->size()) == count); | ||
| auto board_of = [&](const int slot) -> int { | ||
| return use_order ? (*order)[static_cast<unsigned>(slot)] : slot; | ||
| }; |
Only honor the optional dispatch order when it is a valid permutation of [0, count: each element in range and unique. A malformed order (duplicates or out-of-range values) now falls back to index order, preventing invalid board indices from reaching process_board. EOF ) Co-authored-by: Cursor <cursoragent@cursor.com>
| // Map a dispatch slot to the board number to process. With an order, hand out | ||
| // boards in that sequence (e.g. hardest first); otherwise in index order. | ||
| const bool use_order = | ||
| (order != nullptr && static_cast<int>(order->size()) == count); | ||
| auto board_of = [&](const int slot) -> int { | ||
| return use_order ? (*order)[static_cast<unsigned>(slot)] : slot; | ||
| }; |
| static auto is_permutation_of_range( | ||
| const std::vector<int>& order, | ||
| const int count) -> bool | ||
| { |
There was a problem hiding this comment.
We have std::set_intersection and std::set_difference available in . You can use std::ranges::views::iota to generate a range from [0..count].
Standard library functions are at least expected to be faster and reduce the amount of code we need to write.
There was a problem hiding this comment.
Or not…
Copilot reports that the change is less efficient:
is_permutation_of_range() currently re-sorts the order vector and builds an intermediate common vector via set_intersection. This is O(n log n) plus extra allocations and will run on every parallel_all_boards_n() invocation that supplies an order (e.g. batch calc), adding overhead in a performance-sensitive path. You can validate a permutation of [0, count) in O(n) without sorting by tracking seen indices and checking bounds/duplicates.
Thoughts, @zzcgumn?
There was a problem hiding this comment.
I've had another look. I agree with Copilot - I prefer the original implementation:
static auto is_permutation_of_range(
const std::vector<int>& order,
const int count) -> bool
{
std::vector<char> seen(static_cast<unsigned>(count), 0);
for (const int v : order)
{
if (v < 0 || v >= count || seen[static_cast<unsigned>(v)])
return false;
seen[static_cast<unsigned>(v)] = 1;
}
return true;
}
seen could equally well be std::vector<unsigned char>
Use std::ranges::set_intersection and std::views::iota instead of a hand-rolled seen bitmap in is_permutation_of_range. Co-authored-by: Cursor <cursoragent@cursor.com>
| static auto is_permutation_of_range( | ||
| const std::vector<int>& order, | ||
| const int count) -> bool | ||
| { | ||
| std::vector<int> sorted(order); | ||
| std::ranges::sort(sorted); | ||
|
|
||
| const auto expected = std::views::iota(0, count); | ||
|
|
||
| std::vector<int> common; | ||
| common.reserve(static_cast<unsigned>(count)); | ||
| std::ranges::set_intersection( | ||
| sorted, expected, std::back_inserter(common)); | ||
|
|
||
| return static_cast<int>(common.size()) == count; | ||
| } |
Emscripten's libc++ lacks C++20 ranges; use iterator-based sort, iota, and set_intersection instead. Co-authored-by: Cursor <cursoragent@cursor.com>
| * @param order Optional dispatch order: a permutation of [0, count) giving the | ||
| * sequence in which board numbers are handed out (e.g. hardest first to | ||
| * shorten the tail). When null/empty, boards are dispatched in index | ||
| * order. Only the dispatch order changes; @p process_board still receives | ||
| * the real board number, so result placement is unaffected. | ||
| * @return First non-success code from @p process_board, or RETURN_NO_FAULT. |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Scheduler: :Fanout is private; move the fanout logic to a shared deal_fanout() helper so calc_tables can use it without accessing the scheduler instance. Co-authored-by: Cursor <cursoragent@cursor.com>
| static auto is_permutation_of_range( | ||
| const std::vector<int>& order, | ||
| const int count) -> bool | ||
| { | ||
| std::vector<int> sorted(order); | ||
| std::sort(sorted.begin(), sorted.end()); | ||
|
|
||
| std::vector<int> expected(static_cast<unsigned>(count)); | ||
| std::iota(expected.begin(), expected.end(), 0); | ||
|
|
||
| std::vector<int> common; | ||
| common.reserve(static_cast<unsigned>(count)); | ||
| std::set_intersection( | ||
| sorted.begin(), sorted.end(), | ||
| expected.begin(), expected.end(), | ||
| std::back_inserter(common)); | ||
|
|
||
| return static_cast<int>(common.size()) == count; | ||
| } |