fix(eval): fix search_tool correctness always scoring 0%#1675
Conversation
_args_match checked emit_widget (renamed to user_requested_search in the tool schema) and limit (optional, server-side default). Both mismatched on every real model call, so tool_correctness was always False regardless of whether the model used the right keywords and object types. Fix: evaluate only keywords (case-insensitive) and object_types — the two fields that actually determine whether the search was semantically correct.
📝 WalkthroughWalkthrough
Search Tool Argument Matching
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py`:
- Around line 12-16: The _args_match comparison is not defensive enough and can
crash on malformed tool arguments from parsed_arguments(). Update _args_match to
validate and normalize actual_args["keywords"] and actual_args["object_types"]
before lowercasing or sorting, treating any non-string or unexpected entry as a
mismatch instead of raising. Keep the existing matching behavior for valid
inputs, but ensure bad JSON like mixed types or numeric keywords returns False
rather than aborting evaluation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 1e3e175f-7f97-436d-9802-bd0a292bd4bb
📒 Files selected for processing (1)
packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py
| actual_kw = sorted(k.lower() for k in (actual_args.get("keywords") or [])) | ||
| expected_kw = sorted(k.lower() for k in (expected_args.get("keywords") or [])) | ||
| if actual_kw != expected_kw: | ||
| return False | ||
| if sorted(actual_args.get("object_types") or []) != sorted(expected_args.get("object_types") or []): | ||
| return False | ||
| if actual_args.get("limit") != expected_args.get("limit"): | ||
| return False | ||
| return actual_args.get("emit_widget") == expected_args.get("emit_widget") | ||
| return sorted(actual_args.get("object_types") or []) == sorted(expected_args.get("object_types") or []) |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Harden _args_match against malformed JSON argument types.
parsed_arguments() returns raw model-emitted JSON, so a bad tool call like {"keywords":[1]} or mixed object_types will raise here (.lower() / sorted(...)) and abort evaluation instead of producing tool_correctness=False. Treat non-string entries as invalid input and normalize defensively.
Proposed fix
def _args_match(actual_args: dict, expected_args: dict) -> bool:
# Only keywords and object_types determine semantic correctness.
# limit is optional with a server-side default; emit_widget was renamed to
# user_requested_search in the tool schema — neither affects search quality.
- actual_kw = sorted(k.lower() for k in (actual_args.get("keywords") or []))
- expected_kw = sorted(k.lower() for k in (expected_args.get("keywords") or []))
+ def _normalize_str_list(value: object, *, lowercase: bool = False) -> list[str]:
+ if not isinstance(value, list):
+ return []
+ items = [item for item in value if isinstance(item, str)]
+ return sorted(item.lower() if lowercase else item for item in items)
+
+ actual_kw = _normalize_str_list(actual_args.get("keywords"), lowercase=True)
+ expected_kw = _normalize_str_list(expected_args.get("keywords"), lowercase=True)
if actual_kw != expected_kw:
return False
- return sorted(actual_args.get("object_types") or []) == sorted(expected_args.get("object_types") or [])
+ return _normalize_str_list(actual_args.get("object_types")) == _normalize_str_list(
+ expected_args.get("object_types")
+ )📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| actual_kw = sorted(k.lower() for k in (actual_args.get("keywords") or [])) | |
| expected_kw = sorted(k.lower() for k in (expected_args.get("keywords") or [])) | |
| if actual_kw != expected_kw: | |
| return False | |
| if sorted(actual_args.get("object_types") or []) != sorted(expected_args.get("object_types") or []): | |
| return False | |
| if actual_args.get("limit") != expected_args.get("limit"): | |
| return False | |
| return actual_args.get("emit_widget") == expected_args.get("emit_widget") | |
| return sorted(actual_args.get("object_types") or []) == sorted(expected_args.get("object_types") or []) | |
| def _normalize_str_list(value: object, *, lowercase: bool = False) -> list[str]: | |
| if not isinstance(value, list): | |
| return [] | |
| items = [item for item in value if isinstance(item, str)] | |
| return sorted(item.lower() if lowercase else item for item in items) | |
| actual_kw = _normalize_str_list(actual_args.get("keywords"), lowercase=True) | |
| expected_kw = _normalize_str_list(expected_args.get("keywords"), lowercase=True) | |
| if actual_kw != expected_kw: | |
| return False | |
| return _normalize_str_list(actual_args.get("object_types")) == _normalize_str_list( | |
| expected_args.get("object_types") | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@packages/gooddata-eval/src/gooddata_eval/core/evaluators/search_tool.py`
around lines 12 - 16, The _args_match comparison is not defensive enough and can
crash on malformed tool arguments from parsed_arguments(). Update _args_match to
validate and normalize actual_args["keywords"] and actual_args["object_types"]
before lowercasing or sorting, treating any non-string or unexpected entry as a
mismatch instead of raising. Keep the existing matching behavior for valid
inputs, but ensure bad JSON like mixed types or numeric keywords returns False
rather than aborting evaluation.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #1675 +/- ##
=======================================
Coverage 79.21% 79.21%
=======================================
Files 232 232
Lines 15809 15809
=======================================
Hits 12523 12523
Misses 3286 3286 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Summary
The
tool_correctnessmetric insearch_toolevaluations was alwaysFalse(0%), making the dashboard metric useless.Root cause:
_args_matchchecked two fields that never matched real model calls:emit_widget— this parameter was renamed touser_requested_searchin the tool schema; the model never sendsemit_widget, soactual_args.get("emit_widget")was alwaysNonevs the expectedfalselimit— declared as optional (int | None) in the schema; the model omits it and relies on the server default, while fixtures hardcode10Fix: Only check
keywords(case-insensitive) andobject_types— the two fields that determine whether the search was semantically correct.limitand the widget display flag are not part of search correctness.Test Plan
uv run pytest tests/ -x -qinpackages/gooddata-evaltool_correctnessnow reflects actual model behaviour (models that call with the right keywords and object types should score True)Risk
Low — single function change in the evaluator, no effect on eval runner or production code.
Summary by CodeRabbit