8000 [OPIK-1833]: [P SDK] Implement evaluation engine for threads by yaricom · Pull Request #2514 · comet-ml/opik · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[OPIK-1833]: [P SDK] Implement evaluation engine for threads #2514

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 26 commits into
base: main
Choose a base branch
from

Conversation

yaricom
Copy link
Member
@yaricom yaricom commented Jun 18, 2025

Details

The goal for us is to build something similar to opik.evaluate but for evaluating whole conversations on the SDK side.

User flow (once the threads already exist in the project):

  1. Instantiate conversational metrics in the SDK
  2. Call opik.evaluate_threads with these metrics and the project name where threads are stored. Under the hood it will:
    a. Download the threads data from the backend to SDK
    b. Convert every thread (==list of traces) to the format where it will look like a discussion, easy to analyze
    c. Calculate metrics for all the threads (optionally, skip the metric if it’s already computed)
    d. Log the calculated metric result as a feedback score
    e. Save the calculated metric result in a report object
    f. Return the final report object to the user.

Python SDK API example

Since we are going to evaluate existing threads, there are no datasets for that.

from typing import Optional

def evaluate_threads(
    project_name: str,
    filter_string: Optional[str], 
    thread_ids: Optional[List[str]]  
    eval_project_name: Optional[str], 
    metrics: List[ConversationThreadMetric],
    trace_input_transform: Callable[[JsonListStringPublic], str],
    trace_output_transform: Callable[[JsonListStringPublic], str],
    num_workers: int = 8,
) -> ThreadsEvaluationResult # An object containing the information about every thread and the feedback scores it got

Testing

Implemented related test cases.

Documentation

Provided docstring documentation.

yaricom added 15 commits June 18, 2025 20:02
…ng support

- Added `ThreadsEvaluationEngine` for evaluating threads and logging feedback scores.
- Integrated `ThreadsEvaluationResult` model for structured evaluation outputs.
- Developed `execute` function with ThreadPoolExecutor for parallel task processing.
- Added types and utilities to support thread evaluation tasks.
…pt an instance of `threads_client.ThreadsClient` to allow easy mocking during testing.
…ypes ` module

Streamlined `Conversation` and related type imports by using the newly centralized `types` module. Updated method signatures and replaced inline type definitions for better maintainability.
…nEngine`

- Added a logging framework for better traceability.
- Introduced specific exceptions to handle empty metrics and thread data scenarios.
- Improved conversation trace validation to skip evaluation for threads without conversation traces.
- Enhanced span and trace outputs to include detailed evaluation results.
- Implemented comprehensive test coverage for `ThreadsEvaluationEngine` methods.
- Added tests for conversation trace retrieval, feedback score logging, and thread evaluation.
- Verified concurrency behavior with multiple threads and workers.
- Ensured proper exception handling and logging for edge cases.
…py` exports

- Implemented `evaluate_threads` to enable conversation thread evaluation with custom metrics.
- Updated `__all__` in `__init__.py` for proper module export of the new function.
@yaricom yaricom marked this pull request as ready for review June 23, 2025 12:22
@yaricom yaricom requested a review from a team as a code owner June 23, 2025 12:22
yaricom added 11 commits June 23, 2025 19:07
… in thread evaluation logic

- Introduced `EvaluationError` for better exception clarity.
- Updated exception handling in `ThreadsEvaluationEngine` and corresponding unit tests.
… updated logic to handle scoring failures

- Renamed `_get_conversation_tread` to `_get_conversation_thread` across implementation and tests.
- Enhanced scoring logic to exclude failed scores from feedback logging in `ThreadsEvaluationEngine`.
…ontext management for evaluations

- Renamed `_evaluate_thread` to `evaluate_thread` for better clarity.
- Introduced `evaluate_llm_conversation_context` to manage context and trace data during evaluations.
- Moved trace and span logic into `context_helper` for cleaner code organization.
- Updated unit tests to reflect method renaming and added context mock handling.
- General improvement to exception handling and error logging mechanisms across evaluation flows.
…ngine framework

- Replaced `_types.ThreadTestResult` with `evaluation_result.ThreadEvaluationResult` for consistency.
- Merged evaluation logic into `evaluation_tasks_executor` from `evaluation_executor`.
- Updated thread evaluation tasks to use generic `EvaluationTask` with TypeVar to improve type safety.
- Simplified `ThreadsEvaluationResult` structure for clearer results handling.
- Refactored tests to reflect the updated structure and improve validation of evaluation results.
- Introduced `Literal` for type hinting `conversation` keys, improving clarity and type safety.
- Included comments to enhance code readability and maintainability.
- Extracted `_get_conversation_thread` and `_log_feedback_scores` into `helpers` module, improving modularity and reusability.
- Updated `ThreadsEvaluationEngine` to use new helper methods and streamlined logic.
- Added unit tests for `helpers.py` to ensure correctness of extracted functionality.
@yaricom yaricom marked this pull request as draft June 25, 2025 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0