Skip to content

feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760

Open
goel-skd wants to merge 2 commits into
apache:mainfrom
goel-skd:feat-613-unicode-lowercase
Open

feat(string_util): make ToLower Unicode-aware via utf8proc (2/2)#760
goel-skd wants to merge 2 commits into
apache:mainfrom
goel-skd:feat-613-unicode-lowercase

Conversation

@goel-skd

@goel-skd goel-skd commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Replaces the ASCII-only StringUtils::ToLower with a Unicode-aware
implementation backed by utf8proc,
so case-insensitive name handling matches Iceberg Java's
toLowerCase(Locale.ROOT).

  • ToLower now lower-cases UTF-8 input using utf8proc simple (1:1) case
    mapping (e.g. CAFÉcafé, GROẞEgroße). Invalid UTF-8 is
    returned unchanged rather than erroring.

  • EqualsIgnoreCase now compares the lowercased forms of both inputs, so it
    is case-insensitive for non-ASCII letters too.

  • ToUpper is intentionally left ASCII-only — it is not used for name
    matching.

  • utf8proc is wired into both the CMake (vendored via FetchContent / system
    package) and Meson (subprojects/utf8proc.wrap) builds.

    Testing

  • Added/updated string_util_test.cc: ToLowerUnicode, ToUpperAsciiOnly,
    and Unicode cases in EqualsIgnoreCase (including invalid-UTF-8
    pass-through).

Closes #613.

Follow-up to #748

@goel-skd goel-skd force-pushed the feat-613-unicode-lowercase branch from bec8884 to 69cc006 Compare June 19, 2026 01:40
Replace the ASCII-only ToLower with utf8proc simple case mapping so
case-insensitive name handling matches Iceberg Java's
toLowerCase(Locale.ROOT). ToUpper stays ASCII-only since it is not used
for name matching. EqualsIgnoreCase now compares lowercased forms.

Wire utf8proc into both the CMake (vendored/system) and Meson builds.

See apache#613.
@goel-skd goel-skd force-pushed the feat-613-unicode-lowercase branch from 69cc006 to f42e2da Compare June 19, 2026 02:13
Comment thread cmake_modules/IcebergThirdpartyToolchain.cmake

@wgtmac wgtmac left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked the implementation and test yet. Just post some thoughts around APIs.

/// \brief Upper-case ASCII letters; non-ASCII (multibyte UTF-8) bytes pass through
/// unchanged.
///
/// Unlike ToLower this is ASCII-only, since upper-casing is not used for name matching.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike ToLower this is ASCII-only, since upper-casing is not used for name matching.

This is misleading to users. We should probably provide two pairs of functions like ToLowerAscii/ToUpperAscii and ToLowerUtf8/ToUpperUtf8.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASCII is a subset of UTF-8, so the UTF-8 ToLower is already correct on ASCII input. A ToLowerAscii wouldn't be more correct, only faster on known-ASCII data, and it makes the caller decide "is my input ASCII?", which is sort of a foot-gun. One ToLower also matches Iceberg Java, which only has toLowerCase(Locale.ROOT) with no ascii/utf8 split.

I suggest that we skip ToUpperUtf8 either way. Simple-mapping uppercase is just wrong for things like ß (stays ß, should be SS), and nothing here uppercases non-ASCII anyway. The two ToUpper callers just normalize codec/mode strings to match GZIP/ALL.

You're right it's misleading though. I'll rename ToUpper -> ToUpperAscii so it's in the name, and keep ToLower as the Unicode one. WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern on this is the naming. I don't think it is good if one is ToXXXAscii but the other is ToXXXUtf8. It would be good to just use ToUpper and ToLower when we only have a single pair and describe their behavior clearly in the comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wgtmac I agree. What you are suggesting will be least confusing. I will make comment more descriptive along with changing the names of the functions


/// \brief Lower-case a UTF-8 string using Unicode simple case mapping.
///
/// Mirrors Iceberg Java's case-insensitive handling, which lower-cases names with

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not say this is Mirrors Iceberg Java's case-insensitive handling since we are not a 100% exact matching. If we cannot guarantee that, we need to clarify where we might diverge in the comment.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, will post the differences

static bool EqualsIgnoreCase(std::string_view lhs, std::string_view rhs) {
return std::ranges::equal(
lhs, rhs, [](char lc, char rc) { return ToLowerAscii(lc) == ToLowerAscii(rc); });
return ToLower(lhs) == ToLower(rhs);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This incur performance regression (if they are ASCII) because we cannot return early if their prefixes do not equal. Perhaps we can still implement this in a streaming approach?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked the callers, they're all short ASCII on cold paths (metrics mode, "true", a couple UUIDs, a header name), nothing per row. Short strings stay in SSO so ToLower doesn't allocate, and the UUID compare runs once per commit, so I believe there isn't a measurable regression here.

Iceberg Java does the same thing: case-insensitive lookup just lowercases the query and hits a map
(Schema.caseInsensitiveFindField), and that map is built by lowercasing every field name once (TypeUtil.indexByLowerCaseName). No streaming equalsIgnoreCase anywhere; the speed comes from caching the lowercased keys, which we already do in schema.cc/type.cc. So the hot path is covered and EqualsIgnoreCase is just the cold fallback.

Streaming is also trickier than it looks since you can't early-out on length: simple lowercasing can change byte length (İi goes 2 bytes to 1).

I suggest we keep this as is and revisit if a hot-path caller comes up.

@goel-skd

Copy link
Copy Markdown
Contributor Author

I haven't checked the implementation and test yet. Just post some thoughts around APIs.

Thanks much @wgtmac. I responded to your comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

case insensitive field matching behavior different from iceberg-python and iceberg-java

2 participants