[OPIK-1833]: [P SDK] Implement evaluation engine for threads #2514

yaricom · 2025-06-18T17:06:27Z

Details

The goal for us is to build something similar to opik.evaluate but for evaluating whole conversations on the SDK side.

User flow (once the threads already exist in the project):

Instantiate conversational metrics in the SDK
Call opik.evaluate_threads with these metrics and the project name where threads are stored. Under the hood it will:
a. Download the threads data from the backend to SDK
b. Convert every thread (==list of traces) to the format where it will look like a discussion, easy to analyze
c. Calculate metrics for all the threads (optionally, skip the metric if it’s already computed)
d. Log the calculated metric result as a feedback score
e. Save the calculated metric result in a report object
f. Return the final report object to the user.

Python SDK API example

Since we are going to evaluate existing threads, there are no datasets for that.

from typing import Optional

def evaluate_threads(
    project_name: str,
    filter_string: Optional[str], 
    thread_ids: Optional[List[str]]  
    eval_project_name: Optional[str], 
    metrics: List[ConversationThreadMetric],
    trace_input_transform: Callable[[JsonListStringPublic], str],
    trace_output_transform: Callable[[JsonListStringPublic], str],
    num_workers: int = 8,
) -> ThreadsEvaluationResult # An object containing the information about every thread and the feedback scores it got

Testing

Implemented related test cases.

Documentation

Provided docstring documentation.

…arch and feedback logging

…ng support - Added `ThreadsEvaluationEngine` for evaluating threads and logging feedback scores. - Integrated `ThreadsEvaluationResult` model for structured evaluation outputs. - Developed `execute` function with ThreadPoolExecutor for parallel task processing. - Added types and utilities to support thread evaluation tasks.

…pt an instance of `threads_client.ThreadsClient` to allow easy mocking during testing.

…ypes ` module Streamlined `Conversation` and related type imports by using the newly centralized `types` module. Updated method signatures and replaced inline type definitions for better maintainability.

…nEngine` - Added a logging framework for better traceability. - Introduced specific exceptions to handle empty metrics and thread data scenarios. - Improved conversation trace validation to skip evaluation for threads without conversation traces. - Enhanced span and trace outputs to include detailed evaluation results.

- Implemented comprehensive test coverage for `ThreadsEvaluationEngine` methods. - Added tests for conversation trace retrieval, feedback score logging, and thread evaluation. - Verified concurrency behavior with multiple threads and workers. - Ensured proper exception handling and logging for edge cases.

…py` exports - Implemented `evaluate_threads` to enable conversation thread evaluation with custom metrics. - Updated `__all__` in `__init__.py` for proper module export of the new function.

sdks/python/src/opik/evaluation/threads/evaluation_engine.py

sdks/python/src/opik/evaluation/threads/evaluation_executor.py

sdks/python/src/opik/evaluation/threads/evaluation_engine.py

sdks/python/src/opik/evaluation/threads/evaluator.py

sdks/python/tests/unit/evaluation/threads/test_evaluation_engine.py

…pire_immediately` context manager

… in thread evaluation logic - Introduced `EvaluationError` for better exception clarity. - Updated exception handling in `ThreadsEvaluationEngine` and corresponding unit tests.

… updated logic to handle scoring failures - Renamed `_get_conversation_tread` to `_get_conversation_thread` across implementation and tests. - Enhanced scoring logic to exclude failed scores from feedback logging in `ThreadsEvaluationEngine`.

…ontext management for evaluations - Renamed `_evaluate_thread` to `evaluate_thread` for better clarity. - Introduced `evaluate_llm_conversation_context` to manage context and trace data during evaluations. - Moved trace and span logic into `context_helper` for cleaner code organization. - Updated unit tests to reflect method renaming and added context mock handling. - General improvement to exception handling and error logging mechanisms across evaluation flows.

…ngine framework - Replaced `_types.ThreadTestResult` with `evaluation_result.ThreadEvaluationResult` for consistency. - Merged evaluation logic into `evaluation_tasks_executor` from `evaluation_executor`. - Updated thread evaluation tasks to use generic `EvaluationTask` with TypeVar to improve type safety. - Simplified `ThreadsEvaluationResult` structure for clearer results handling. - Refactored tests to reflect the updated structure and improve validation of evaluation results.

- Introduced `Literal` for type hinting `conversation` keys, improving clarity and type safety. - Included comments to enhance code readability and maintainability.

- Extracted `_get_conversation_thread` and `_log_feedback_scores` into `helpers` module, improving modularity and reusability. - Updated `ThreadsEvaluationEngine` to use new helper methods and streamlined logic. - Added unit tests for `helpers.py` to ensure correctness of extracted functionality.

yaricom added 15 commits June 18, 2025 20:02

[OPIK-1833]: Added threads evaluation engine stub

06f194f

Merge branch 'main' into OPIK-1833-threads-evaluation-engine

9c77f42

[OPIK-1833]: Added ThreadsClient stub with core methods for thread se…

3036e43

…arch and feedback logging

[OPIK-1833]: Added ThreadsClient stub with core methods for thread se…

22978c4

…arch and feedback logging

Merge branch 'main' into OPIK-1833-threads-evaluation-engine

d6094c4

[OPIK-1833]: Implemented proper handling of score results

587445f

[OPIK-1833]: Refactored ThreadsEvaluationEngine constructor to acce…

3049756

…pt an instance of `threads_client.ThreadsClient` to allow easy mocking during testing.

Merge branch 'main' into OPIK-1833-threads-evaluation-engine

f20b7af

[OPIK-1833]: Refactored conversation imports to use a centralized ` t…

b81e38d

…ypes ` module Streamlined `Conversation` and related type imports by using the newly centralized `types` module. Updated method signatures and replaced inline type definitions for better maintainability.

Merge branch 'main' into OPIK-1833-threads-evaluation-engine

45f053d

[OPIK-1833]: Added evaluate_threads function and updated `__init__.…

21c5e1f

…py` exports - Implemented `evaluate_threads` to enable conversation thread evaluation with custom metrics. - Updated `__all__` in `__init__.py` for proper module export of the new function.

Merge branch 'main' into OPIK-1833-threads-evaluation-engine

0c01487

yaricom marked this pull request as ready for review June 23, 2025 12:22

yaricom requested a review from a team as a code owner June 23, 2025 12:22

alexkuzmik requested changes Jun 23, 2025

View reviewed changes

yaricom added 11 commits June 23, 2025 19:07

Merge branch 'main' into OPIK-1833-threads-evaluation-engine

b2dd8ea

[OPIK-1833]: Wrapped threads evaluation by `async_http_connections_ex…

3eb31bc

…pire_immediately` context manager

[OPIK-1833]: Replaced MetricComputationError with EvaluationError…

7d42c71

… in thread evaluation logic - Introduced `EvaluationError` for better exception clarity. - Updated exception handling in `ThreadsEvaluationEngine` and corresponding unit tests.

[OPIK-1833]: Added better type annotations and comments

da33dd9

- Introduced `Literal` for type hinting `conversation` keys, improving clarity and type safety. - Included comments to enhance code readability and maintainability.

Merge branch 'main' into OPIK-1833-threads-evaluation-engine

783acff

[OPIK-1833]: Fixed coment

c365dd1

Merge branch 'main' into OPIK-1833-threads-evaluation-engine

1bf2444

yaricom marked this pull request as draft June 25, 2025 12:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OPIK-1833]: [P SDK] Implement evaluation engine for threads #2514

[OPIK-1833]: [P SDK] Implement evaluation engine for threads #2514

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[OPIK-1833]: [P SDK] Implement evaluation engine for threads #2514

Are you sure you want to change the base?

[OPIK-1833]: [P SDK] Implement evaluation engine for threads #2514

Uh oh!

Conversation

Uh oh!

Details

User flow (once the threads already exist in the project):

Python SDK API example

Testing

Documentation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!