-
Notifications
You must be signed in to change notification settings - Fork 702
[OPIK-1833]: [P SDK] Implement evaluation engine for threads #2514
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
yaricom
wants to merge
26
commits into
main
Choose a base branch
from
OPIK-1833-threads-evaluation-engine
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…arch and feedback logging
…arch and feedback logging
…ng support - Added `ThreadsEvaluationEngine` for evaluating threads and logging feedback scores. - Integrated `ThreadsEvaluationResult` model for structured evaluation outputs. - Developed `execute` function with ThreadPoolExecutor for parallel task processing. - Added types and utilities to support thread evaluation tasks.
…pt an instance of `threads_client.ThreadsClient` to allow easy mocking during testing.
…ypes ` module Streamlined `Conversation` and related type imports by using the newly centralized `types` module. Updated method signatures and replaced inline type definitions for better maintainability.
…nEngine` - Added a logging framework for better traceability. - Introduced specific exceptions to handle empty metrics and thread data scenarios. - Improved conversation trace validation to skip evaluation for threads without conversation traces. - Enhanced span and trace outputs to include detailed evaluation results.
- Implemented comprehensive test coverage for `ThreadsEvaluationEngine` methods. - Added tests for conversation trace retrieval, feedback score logging, and thread evaluation. - Verified concurrency behavior with multiple threads and workers. - Ensured proper exception handling and logging for edge cases.
…py` exports - Implemented `evaluate_threads` to enable conversation thread evaluation with custom metrics. - Updated `__all__` in `__init__.py` for proper module export of the new function.
alexkuzmik
requested changes
Jun 23, 2025
sdks/python/tests/unit/evaluation/threads/test_evaluation_engine.py
Outdated
Show resolved
Hide resolved
…pire_immediately` context manager
… in thread evaluation logic - Introduced `EvaluationError` for better exception clarity. - Updated exception handling in `ThreadsEvaluationEngine` and corresponding unit tests.
… updated logic to handle scoring failures - Renamed `_get_conversation_tread` to `_get_conversation_thread` across implementation and tests. - Enhanced scoring logic to exclude failed scores from feedback logging in `ThreadsEvaluationEngine`.
…ontext management for evaluations - Renamed `_evaluate_thread` to `evaluate_thread` for better clarity. - Introduced `evaluate_llm_conversation_context` to manage context and trace data during evaluations. - Moved trace and span logic into `context_helper` for cleaner code organization. - Updated unit tests to reflect method renaming and added context mock handling. - General improvement to exception handling and error logging mechanisms across evaluation flows.
…ngine framework - Replaced `_types.ThreadTestResult` with `evaluation_result.ThreadEvaluationResult` for consistency. - Merged evaluation logic into `evaluation_tasks_executor` from `evaluation_executor`. - Updated thread evaluation tasks to use generic `EvaluationTask` with TypeVar to improve type safety. - Simplified `ThreadsEvaluationResult` structure for clearer results handling. - Refactored tests to reflect the updated structure and improve validation of evaluation results.
- Introduced `Literal` for type hinting `conversation` keys, improving clarity and type safety. - Included comments to enhance code readability and maintainability.
- Extracted `_get_conversation_thread` and `_log_feedback_scores` into `helpers` module, improving modularity and reusability. - Updated `ThreadsEvaluationEngine` to use new helper methods and streamlined logic. - Added unit tests for `helpers.py` to ensure correctness of extracted functionality.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Details
The goal for us is to build something similar to
opik.evaluate
but for evaluating whole conversations on the SDK side.User flow (once the threads already exist in the project):
opik.evaluate_threads
with these metrics and the project name where threads are stored. Under the hood it will:a. Download the threads data from the backend to SDK
b. Convert every thread (==list of traces) to the format where it will look like a discussion, easy to analyze
c. Calculate metrics for all the threads (optionally, skip the metric if it’s already computed)
d. Log the calculated metric result as a feedback score
e. Save the calculated metric result in a report object
f. Return the final report object to the user.
Python SDK API example
Since we are going to evaluate existing threads, there are no datasets for that.
Testing
Implemented related test cases.
Documentation
Provided docstring documentation.