-
Notifications
You must be signed in to change notification settings - Fork 1.7k
test(max): Make evals reflect user flow by using root node #32751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This PR restructures the AI evaluation framework to test insight type selection by routing queries through the root node, better reflecting real user interaction flow.
- Added
call_root_for_insight_generation
fixture inee/hogai/eval/conftest.py
to route all insight evaluations through root node - Introduced
QueryKindSelection
scorer inee/hogai/eval/scorers.py
to verify correct insight type selection - Added
add_query_creation_flow
method inee/hogai/graph/graph.py
to separate query creation from execution - Refactored all insight evaluations (trends, funnels, retention, SQL) to use root node routing instead of direct node access
- Removed individual
call_node
fixtures from each insight evaluation file in favor of unified root approach
8 file(s) reviewed, 5 comment(s)
Edit PR Review Bot Settings | Greptile
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
18e1f68
to
821ded8
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't this become superseded by the new planning architecture you're working on?
This comment was marked as outdated.
This comment was marked as outdated.
🧠 AI eval resultsEvaluated 6 experiments, comprising 19 metrics. funnel🆕 QueryKindSelection: 100.00% Avg. case performance: ⏱️ 91.16 s, 🔢 6338 tokens, 💵 $0.0166 in tokens memory🆕 ToolRelevance: 98.24% Avg. case performance: ⏱️ 6.43 s, 🔢 1213 tokens, 💵 $0.0033 in tokens retention🆕 QueryKindSelection: 100.00% Avg. case performance: ⏱️ 28.69 s, 🔢 5384 tokens, 💵 $0.0162 in tokens root🆕 ToolRelevance: 58.88% Avg. case performance: ⏱️ 5.36 s, 🔢 0 tokens sql🆕 QueryKindSelection: 0.00% Avg. case performance: ⏱️ 13.82 s, 🔢 16194 tokens, 💵 $0.0415 in tokens trends🆕 QueryKindSelection: 100.00% Avg. case performance: ⏱️ 39.50 s, 🔢 11009 tokens, 💵 $0.0296 in tokens Triggered by this commit. |
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Had this on the back burner for a bit, but good to go now. |
🧠 AI eval resultsEvaluated 6 experiments, comprising 19 metrics. funnel🆕 QueryKindSelection: 100.00% Avg. case performance: ⏱️ 93.25 s, 🔢 6209 tokens, 💵 $0.0163 in tokens memory🆕 ToolRelevance: 99.16% Avg. case performance: ⏱️ 5.02 s, 🔢 1213 tokens, 💵 $0.0033 in tokens retention🆕 QueryKindSelection: 100.00% Avg. case performance: ⏱️ 31.34 s, 🔢 4557 tokens, 💵 $0.0162 in tokens root🆕 ToolRelevance: 58.88% Avg. case performance: ⏱️ 5.78 s, 🔢 0 tokens sql🆕 QueryKindSelection: 0.00% Avg. case performance: ⏱️ 11.68 s, 🔢 12605 tokens, 💵 $0.0323 in tokens trends🆕 QueryKindSelection: 100.00% Avg. case performance: ⏱️ 43.00 s, 🔢 9889 tokens, 💵 $0.0267 in tokens Triggered by this commit. |
Problem
We have dozens of AI evaluation cases, but they don't test insight type selection right now, as they all feed questions into an already-selected insight type flow.
Changes
Let's make insight creation evals evaluate reality, by going through the root node. This will allow us to actually see the failure modes around wrong insight types being chosen. Insight-kind specific
call_node
s fixtures are replaced by a commoncall_root_for_insight_generation
.We now also have a simple eval scorer for query kind selection, called…
QueryKindSelection
.How did you test this code?
These are the tests.