Skip to content

GH-50194: [C++] Move S3 and AWS-SDK to its own libarrow_s3.so#50195

Draft
raulcd wants to merge 4 commits into
apache:mainfrom
raulcd:GH-50194
Draft

GH-50194: [C++] Move S3 and AWS-SDK to its own libarrow_s3.so#50195
raulcd wants to merge 4 commits into
apache:mainfrom
raulcd:GH-50194

Conversation

@raulcd

@raulcd raulcd commented Jun 16, 2026

Copy link
Copy Markdown
Member

Warning

Do not merge, this PR is currently on discussion of the current approach

Rationale for this change

Trying to reduce the size of libarrow.so and remove AWS SDK on some builds. Allow for users to plug and play based on requirements and divide our functionality into cleaner modules.

What changes are included in this PR?

Unconditionally build S3 and the AWS SDK into a different module libarrow_s3.so outside of libarrow.so.
Update bindings to link against the new libarrow_s3.so library.
Update the Linux Package jobs to have the new module into a different package.

Are these changes tested?

Yes via CI

Are there any user-facing changes?

Yes, users will need to either link against libarrow_s3.so or register using LoadFileSystemFactories

@github-actions

This comment was marked as off-topic.

@github-actions github-actions Bot removed the CI: Extra: C++ Run extra C++ CI label Jun 22, 2026
@raulcd raulcd added CI: Extra: R Run extra R CI CI: Extra: Package: Linux Run extra Linux Packages CI labels Jun 22, 2026
@raulcd

raulcd commented Jun 22, 2026

Copy link
Copy Markdown
Member Author

@pitrou @kou I've been working on splitting the S3 library (and the AWS SDK) outside libarrow.so into its own library libarrow_s3.so.
On this PR I am just moving the AWS-SDK and the s3 filesystem related source into its own library. Any user trying to leverage it would require linking against it in order to use it (as we do with bindings) or dlopen via LoadFileSystemFactories (path), which would register at load time, no link dependency.
On this PR I am not planning on moving our existing bindings to the FileSystemFromUriAndOptions and LoadFileSystemFactories path. I am also not sure we should do that. I think that path is good for a user that doesn't want to link against libarrow_s3.so and have the same functionality but probably not what the majority of users (and our internal bindings) should do? What are your thoughts on that?
As per the size of the artifacts with the new code the size of libarrow.so and libarrow_s3.so:

$ ls -lhL libarrow.so libarrow_s3.so
-rwxrwxr-x 1 raulcd raulcd  59M Jun 22 19:30 libarrow_s3.so
-rwxrwxr-x 1 raulcd raulcd 317M Jun 22 19:29 libarrow.so

And we can see AWS symbols aren't present on libarrow.so

$  nm -C libarrow.so | grep -c "Aws::"
0
$ nm -C libarrow_s3.so | grep -c "Aws::"
33991

With current main libarrow.so size and it contains AWS SDK symbols:

$ ls -lhL libarrow.so
-rwxrwxr-x 1 raulcd raulcd 368M Jun 22 19:45 libarrow.so
$ ls -lhL libarrow_s3.so
ls: cannot access 'libarrow_s3.so': No such file or directory
$ nm -C libarrow.so | grep -c "Aws::"
33991

Those are debug builds but as a summary:
libarrow.so goes from 368M to 317M (~51M smaller), and AWS (33,991 symbols) move entirely into the new 59M libarrow_s3.so.

@kou

kou commented Jun 23, 2026

Copy link
Copy Markdown
Member

I think that we should use LoadFileSystemFactories() for bindings to avoid loading the S3 module for users who don't need S3. For example, PyArrow users who also want to use the S3 module, they will install pyarrow_s3 (or something)` explicitly.

I think that bindings can provide convenient API to use the S3 module even if we use LoadFileSystemFactories().

@raulcd

raulcd commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

I think that we should use LoadFileSystemFactories() for bindings to avoid loading the S3 module for users who don't need S3. For example, PyArrow users who also want to use the S3 module, they will install pyarrow_s3 (or something)` explicitly.

With conda this isn't necessary, we already ship all the .so as different packages allowing users to pick and choose what to install. This won't change, if a user installs pyarrow-core and installs libarrow-s3 will have S3 capabilities, same as if we install today pyarrow-core with libarrow-flight. If libarrow-s3 is not installed PyArrow would just ImportError when not finding the corresponding DLL. Basically we build with all capabilities turned on but install only the necessary .so and if not found they ImportError . I'll validate pyarrow fails with ImportError if the .so isn't present but this should behave as the other modules.

With wheels this is another different beast and I have to explore a little further. A related issue:

The original problem we had with wheels is that there's no mechanism to share dependencies between wheels. Auditwheel/delvewheel/delocate mangle the .so name to avoid other wheels clashing with other dependencies symbols. The problem is that libarrow_s3.so requires libarrow symbols and it's not clear how this pyarrow_s3 would be shipped. Should it include its own libarrow using the mangled symbols for the new wheel? Should it use the libarrow library coming from the main pyarrow wheel? What happens with different versions of pyarrow and pyarrow_s3 installed?

As a note, I've just validated we don't mangle libarrow (or any of our .so) on the wheels. I am going to start exploring this a little further to see if I can come up with something even though I am still unclear about some of the questions above, like version matching to avoid ABI problems.

Related: @amol- who worked on consolidatewheels in the past:

And some Python PEP attempts to define some external dependencies for wheels are on discussion:
https://discuss.python.org/t/pep-725-specifying-external-dependencies-in-pyproject-toml-round-2/103890

What I am saying is that using LoadFileSystemFactories() isn't solving the real problem which in my opinion is: how do we share a single libarrow between several extra wheels and coordinate versioning?

cc @h-vetinari who knows this space and might shed some light

@raulcd

raulcd commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

cc @jorisvandenbossche

@h-vetinari

Copy link
Copy Markdown
Contributor

For the PyPI side, you might be able to do something similar to what numpy/scipy are doing with openblas as a wheel.

target_link_libraries(arrow_s3fs PRIVATE ${AWSSDK_LINK_LIBRARIES} arrow_shared)
set_source_files_properties(filesystem/s3fs.cc filesystem/s3fs_module.cc
PROPERTIES SKIP_UNITY_BUILD_INCLUSION ON)
if(ARROW_BUILD_STATIC AND WIN32)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AND WIN32 isn't useful, right?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the same pattern on other places:

  if(ARROW_BUILD_STATIC AND WIN32)
    target_compile_definitions(arrow_compute_static PUBLIC ARROW_COMPUTE_STATIC)
  endif()

or

if(ARROW_BUILD_STATIC AND WIN32)
  target_compile_definitions(arrow_static PUBLIC ARROW_STATIC)
endif()

Taking a look at the definition on visibility.h of ARROW_S3_STATIC is already guarded for WIN32:

#if defined(_WIN32) || defined(__CYGWIN__)

So it will only be used on WIN32, it does not seem necessary on others so I would say the AND WIN32 does nothing but it's hygiene?

Comment on lines +983 to +984
string(APPEND ARROW_S3_PC_CFLAGS "${ARROW_S3_PC_CFLAGS_PRIVATE}")
set(ARROW_S3_PC_CFLAGS_PRIVATE "")

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this? It doesn't seem used below?

@raulcd raulcd Jun 23, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I remember, this is used to populate the following:

Cflags:@ARROW_S3_PC_CFLAGS@
Cflags.private:@ARROW_S3_PC_CFLAGS_PRIVATE@

on arrow/cpp/src/arrow/arrow-s3.pc.in this is using the same mechanism we introduced for other libraries here:
3351aeb

Something to do with pkg-config and static builds, @kou might share some light on why this was necessary I hardly remember but can re-explore again.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, Cflags.private is used only for static build (pkgconf --static ...). If we build only static library, pkgconf ... (no --static) doesn't work. It's inconvenient.

This is for making pkgconf ... (no --static) workable with static library only build. (It's a R package case.)

@pitrou

pitrou commented Jun 23, 2026

Copy link
Copy Markdown
Member

So, this is as if ARROW_S3_MODULE was always enabled, right?

@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 23, 2026
@raulcd

raulcd commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

So, this is as if ARROW_S3_MODULE was always enabled, right?

Yes but with a small caveat. ARROW_S3_MODULE bundled both into libarrow and generated a separate .so. This removes ARROW_S3_MODULE and makes ARROW_S3 unconditionally generate a new libarrow_s3.so and remove AWS SDK and s3fs.cc from libarrow.so. The bindings link against this new library. I have also validated the size reduction on libarrow.so and that no symbols for AWS are included in it.

@pitrou

pitrou commented Jun 23, 2026

Copy link
Copy Markdown
Member

Oh, great, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants