GH-49913: [Archery] Add preserve-dir and improve directory layout#50056
GH-49913: [Archery] Add preserve-dir and improve directory layout#50056AntoinePrv wants to merge 9 commits into
Conversation
|
|
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR enhances archery benchmark workflows by allowing a user-specified preserved workspace location and by emitting benchmark result artifacts (results + run metadata) to disk, alongside a broad CLI formatting/refactor pass.
Changes:
- Add
--preserve-dirto benchmarks and extendtmpdir()to support preserving in a specified location. - Create per-run result directories (C++ runner) and write
benchmark.json+metadata.jsonduringbenchmark run. - Refactor Click option declarations and normalize quoting/formatting across CLI.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| dev/archery/archery/utils/tmpdir.py | Extends tempdir context manager to optionally reuse/create a preserved directory path. |
| dev/archery/archery/cli.py | Adds --preserve-dir, writes benchmark results/metadata files, and reformats many Click options. |
| dev/archery/archery/benchmark/runner.py | Adds stable-ish dir naming for revs, per-run IDs/results dirs (C++), and reuses existing clones. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Note you can simply pass an existing build dir where you have already run the CMake configuration phase: It will even rebuild automatically. I find it simpler and easier to understand than asking Archery to preserve its own build dir. |
|
@pitrou the idea here was to bring an improvement on top of manually specified folder organization. |
| type=click.Path(file_okay=False, resolve_path=True), | ||
| default=None, | ||
| help="Parent directory in which to create the preserved " | ||
| "workspace. Has no effect without --preserve."), |
There was a problem hiding this comment.
Well, perhaps --preserve-dir should automatically imply --preserve? One less option to type.
There was a problem hiding this comment.
I thought I'd stay explicit. That way one can alias with some --preserve-dir and then toogle --preserve on demand. Too much?
| if isinstance(sha, bytes): | ||
| sha = sha.decode("ascii") | ||
| return sha | ||
| except Exception: |
There was a problem hiding this comment.
This is too broad, which kind of error are we trying to intercept here?
There was a problem hiding this comment.
Yhea, catching subprocess.CalledProcessError but really the whole command wrappers should properly isolate that.
| return CppBenchmarkRunner(cmake_def.build(build_dir), **kwargs) | ||
| run_root = os.path.join(root_rev, "build", "run") | ||
| os.makedirs(run_root, exist_ok=True) | ||
| run_id = f"{random.randrange(16**8):08x}" |
There was a problem hiding this comment.
Hmm, so each invocation with --preserve-dir will create an additional CMake build directory? Won't that blow up disk footprint for the preserve_dir ?
There was a problem hiding this comment.
Yes, reusing it seemed to far fetched for this PR. Many things could have changed: CMake options, env vars, dependencies...
I guess we could delete it afterwards, maybe keeping / reading CMakeCache.txt which contains precious information.
On my computer a build dir is ~250Mb so even ten of them is manageable, but no big opinion until we can find a clean way to reuse them.
|
@pitrou waiting for you input on deleting the build dir. |
| json_out = json.dumps(runner_base, cls=JsonEncoder) | ||
| if runner_base.results_dir is not None: | ||
| results_path = os.path.join(runner_base.results_dir, | ||
| "benchmark.json") | ||
| with open(results_path, "w") as f: |
| type=click.Path(file_okay=False, resolve_path=True), | ||
| default=None, | ||
| help="Parent directory in which to create the preserved " | ||
| "workspace. Has no effect without --preserve."), |
|
@pitrou done |
Rationale for this change
Improve workflow when running many benchmarks.
Namely eagerly store results and avoid user having to manage large amount of file names.
What changes are included in this PR?
With the following code run twice,
The
preserve-diris used and its internal structure is as follows.The commit is uniquerly resolved source is reused between runs.
Separate invocation results in separate builds and benchmark execution.
The idea would be to be able to reuse the build dir but it is tricky to tell when it should be changed (If CLI params are different? If environment var change? If dependencies change?).
The idea is to get started minimally without to much maintenance.
Metadata are also a starting point to avoid forgetting what a (costly!) run was about.
Example from current run:
{ "time": "2026-05-27T13:58:02.792064+00:00", "cmd": { "full": "archery benchmark run --suite-filter=parquet --benchmark-filter=Decode --preserve --output out.json HEAD --preserve --preserve-dir bench/", "params": { "suite_filter": "parquet", "benchmark_filter": "Decode", "preserve": true, "output": "<unopened file 'out.json' w>", "preserve_dir": ".../arrow/bench", "rev_or_path": "HEAD", "cpp_benchmark_extras": [], "cmake_extras": [], "benchmark_extras": [], "build_extras": [], "language": "cpp", "src": ".../arrow", "cpp_package_prefix": null, "cxx_flags": null, "cxx": null, "cc": null, "java_options": null, "java_home": null, "repetitions": -1, "repetition_min_time": null } }, "machine_info": { "platform": "macOS-26.5-arm64-arm-64bit-Mach-O", "system": "Darwin", "release": "25.5.0", "version": "Darwin Kernel Version 25.5.0: Mon Apr 27 20:41:06 PDT 2026; root:xnu-12377.121.6~2/RELEASE_ARM64_T6030", "machine": "arm64", "processor": "arm", "architecture": "64bit", "logical_cores": 12 } }Are these changes tested?
Yes
Are there any user-facing changes?
Archery user that looked into the
--preservedirectory.