Skip to content

GH-49913: [Archery] Add preserve-dir and improve directory layout#50056

Open
AntoinePrv wants to merge 9 commits into
apache:mainfrom
AntoinePrv:archery
Open

GH-49913: [Archery] Add preserve-dir and improve directory layout#50056
AntoinePrv wants to merge 9 commits into
apache:mainfrom
AntoinePrv:archery

Conversation

@AntoinePrv

@AntoinePrv AntoinePrv commented May 27, 2026

Copy link
Copy Markdown
Collaborator

Rationale for this change

Improve workflow when running many benchmarks.
Namely eagerly store results and avoid user having to manage large amount of file names.

What changes are included in this PR?

With the following code run twice,

archery benchmark run --suite-filter=parquet --benchmark-filter=Decode --preserve --output out.json HEAD --preserve --preserve-dir  bench/

The preserve-dir is used and its internal structure is as follows.

bench
└── 0202ab5295fee14c6c7b85b05ca78cc5c281b3a5
    ├── arrow
    │   └── // Source files...
    ├── bench
    │   ├── 84e0ab99
    │   │   ├── benchmark.json
    │   │   └── metadata.json
    │   └── e15e438a
    │       ├── benchmark.json
    │       └── metadata.json
    └── build
        ├── 84e0ab99
        │   └── // Cmake build files...
        └── e15e438a
            └── // Cmake build files...

The commit is uniquerly resolved source is reused between runs.
Separate invocation results in separate builds and benchmark execution.
The idea would be to be able to reuse the build dir but it is tricky to tell when it should be changed (If CLI params are different? If environment var change? If dependencies change?).

The idea is to get started minimally without to much maintenance.

Metadata are also a starting point to avoid forgetting what a (costly!) run was about.
Example from current run:

{
  "time": "2026-05-27T13:58:02.792064+00:00",
  "cmd": {
    "full": "archery benchmark run --suite-filter=parquet --benchmark-filter=Decode --preserve --output out.json HEAD --preserve --preserve-dir bench/",
    "params": {
      "suite_filter": "parquet",
      "benchmark_filter": "Decode",
      "preserve": true,
      "output": "<unopened file 'out.json' w>",
      "preserve_dir": ".../arrow/bench",
      "rev_or_path": "HEAD",
      "cpp_benchmark_extras": [],
      "cmake_extras": [],
      "benchmark_extras": [],
      "build_extras": [],
      "language": "cpp",
      "src": ".../arrow",
      "cpp_package_prefix": null,
      "cxx_flags": null,
      "cxx": null,
      "cc": null,
      "java_options": null,
      "java_home": null,
      "repetitions": -1,
      "repetition_min_time": null
    }
  },
  "machine_info": {
    "platform": "macOS-26.5-arm64-arm-64bit-Mach-O",
    "system": "Darwin",
    "release": "25.5.0",
    "version": "Darwin Kernel Version 25.5.0: Mon Apr 27 20:41:06 PDT 2026; root:xnu-12377.121.6~2/RELEASE_ARM64_T6030",
    "machine": "arm64",
    "processor": "arm",
    "architecture": "64bit",
    "logical_cores": 12
  }
}

Are these changes tested?

Yes

Are there any user-facing changes?

Archery user that looked into the --preserve directory.

Copilot AI review requested due to automatic review settings May 27, 2026 14:57
@github-actions

Copy link
Copy Markdown

⚠️ GitHub issue #49913 has been automatically assigned in GitHub to PR creator.

@github-actions github-actions Bot added the awaiting review Awaiting review label May 27, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR enhances archery benchmark workflows by allowing a user-specified preserved workspace location and by emitting benchmark result artifacts (results + run metadata) to disk, alongside a broad CLI formatting/refactor pass.

Changes:

  • Add --preserve-dir to benchmarks and extend tmpdir() to support preserving in a specified location.
  • Create per-run result directories (C++ runner) and write benchmark.json + metadata.json during benchmark run.
  • Refactor Click option declarations and normalize quoting/formatting across CLI.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
dev/archery/archery/utils/tmpdir.py Extends tempdir context manager to optionally reuse/create a preserved directory path.
dev/archery/archery/cli.py Adds --preserve-dir, writes benchmark results/metadata files, and reformats many Click options.
dev/archery/archery/benchmark/runner.py Adds stable-ish dir naming for revs, per-run IDs/results dirs (C++), and reuses existing clones.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread dev/archery/archery/utils/tmpdir.py
Comment thread dev/archery/archery/benchmark/runner.py
Comment thread dev/archery/archery/cli.py
Comment thread dev/archery/archery/cli.py
Comment thread dev/archery/archery/cli.py
@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels May 27, 2026
@pitrou

pitrou commented Jun 11, 2026

Copy link
Copy Markdown
Member

Note you can simply pass an existing build dir where you have already run the CMake configuration phase:

$ cmake --preset ninja-benchmarks -S . -B /build/bench-build
$ archery benchmark run /build/bench-build --suite-filter=parquet --benchmark-filter=Decode --output out.json

It will even rebuild automatically. I find it simpler and easier to understand than asking Archery to preserve its own build dir.

@AntoinePrv

AntoinePrv commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

@pitrou the idea here was to bring an improvement on top of manually specified folder organization.
If you need to benchmark at different commits, or re-run benchmarks multiple times during a feature development, it becomes burdensome and error-prone to have to manage folders like bench-build and output.json: checking out a revision, naming things, typing them, remembering when the build folder was last build etc. (see issue for full argument).

Comment thread dev/archery/archery/cli.py Outdated
type=click.Path(file_okay=False, resolve_path=True),
default=None,
help="Parent directory in which to create the preserved "
"workspace. Has no effect without --preserve."),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, perhaps --preserve-dir should automatically imply --preserve? One less option to type.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought I'd stay explicit. That way one can alias with some --preserve-dir and then toogle --preserve on demand. Too much?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too much IMHO indeed :)

Comment thread dev/archery/archery/cli.py
Comment thread dev/archery/archery/cli.py
Comment thread dev/archery/archery/benchmark/runner.py Outdated
if isinstance(sha, bytes):
sha = sha.decode("ascii")
return sha
except Exception:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too broad, which kind of error are we trying to intercept here?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yhea, catching subprocess.CalledProcessError but really the whole command wrappers should properly isolate that.

Comment thread dev/archery/archery/benchmark/runner.py
Comment thread dev/archery/archery/benchmark/runner.py
Comment thread dev/archery/archery/benchmark/runner.py Outdated
return CppBenchmarkRunner(cmake_def.build(build_dir), **kwargs)
run_root = os.path.join(root_rev, "build", "run")
os.makedirs(run_root, exist_ok=True)
run_id = f"{random.randrange(16**8):08x}"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so each invocation with --preserve-dir will create an additional CMake build directory? Won't that blow up disk footprint for the preserve_dir ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, reusing it seemed to far fetched for this PR. Many things could have changed: CMake options, env vars, dependencies...

I guess we could delete it afterwards, maybe keeping / reading CMakeCache.txt which contains precious information.

On my computer a build dir is ~250Mb so even ten of them is manageable, but no big opinion until we can find a clean way to reuse them.

Comment thread dev/archery/archery/benchmark/runner.py Outdated
Copilot AI review requested due to automatic review settings June 23, 2026 13:18
@AntoinePrv

Copy link
Copy Markdown
Collaborator Author

@pitrou waiting for you input on deleting the build dir.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

Comment thread dev/archery/archery/benchmark/runner.py
Comment thread dev/archery/archery/benchmark/runner.py
Comment thread dev/archery/archery/benchmark/runner.py
Comment thread dev/archery/archery/cli.py
Comment on lines 496 to +500
json_out = json.dumps(runner_base, cls=JsonEncoder)
if runner_base.results_dir is not None:
results_path = os.path.join(runner_base.results_dir,
"benchmark.json")
with open(results_path, "w") as f:
Comment thread dev/archery/archery/cli.py Outdated
type=click.Path(file_okay=False, resolve_path=True),
default=None,
help="Parent directory in which to create the preserved "
"workspace. Has no effect without --preserve."),

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too much IMHO indeed :)

Comment thread dev/archery/archery/cli.py
Comment thread dev/archery/archery/benchmark/runner.py Outdated
@AntoinePrv

Copy link
Copy Markdown
Collaborator Author

@pitrou done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting committer review Awaiting committer review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants