feat: tool calling custom interfaces tasks extension #636

jmatejcz · 2025-06-17T14:08:55Z

Purpose

Make new custom interfaces tasks for tool calling benchmark, test and adjust task prompts and system prompt
For now custom interfaces tasks revolve around checking the interface of given topic/service and publish/call it once. Also majority of them aren't predefined and ready to import.

Proposed Changes

Changed prompts in existing tasks
Added 3 hard tasks that require calling multiple services/topics
Total of 12 different tasks
Predefined total of 18 tasks
Added tests for predefined tasks
Refactored timeout to depend on number of required calls

Issues

Partially define final tasks in tool calling benchmark #576

Testing

Mark only cusotm interfaces tasks, i recommend:

model_names = ["gpt-4o-mini"]
    vendors = ["openai"]

    # Define benchmarks that will be used
    mani_conf = ManipulationO3DEBenchmarkConfig(
        o3de_config_path="src/rai_bench/rai_bench/manipulation_o3de/predefined/configs/o3de_config.yaml",  # path to your o3de config
        levels=[  # define what difficulty of tasks to include in benchmark
            "trivial",
        ],
        repeats=1,  # how many times to repeat
    )
    tool_conf = ToolCallingAgentBenchmarkConfig(
        extra_tool_calls=[5],  # how many extra tool calls allowed to still pass
        task_types=[  # what types of tasks to include
            # "basic",
            # "spatial_reasoning",
            # "navigation",
            "custom_interfaces",
            # "manipulation",
        ],
        complexities=["easy"]
        N_shots=[2],  # examples in system prompt
        prompt_detail=["descriptive"],  # how descriptive should task prompt be
        repeats=1,
    )

then run;

python src/rai_bench/rai_bench/examples/benchmarking_models.py

View results, check if everything works
Run tests

pytest tests/rai_bench/tool_calling_agent

Check the task prompts in rai_bench/tool_calling_agent/tasks/basic.py if they make sense to you
Check the subtasks in rai_bench/tool_calling_agent/predefined/tasks_tasks.py if they make sense to you

using const variables for defining

add missing licenses

adjusted prompts

added default valdiators for some tasks

…idators

jmatejcz · 2025-07-04T12:29:55Z

@boczekbartek
I've tries to adjust these changes to things we disscused in basic tasks, so to make default validators, I've reduced the number of parameters and variables required, but these custom interfaces tasks has especially large number of consts, variables and parameters as they require filling Interfaces which can be long.

I didnt change logic in these last commits but moved a lot of code, so please let me know if this is somewhat clear and made with sense now, because i'm a little bit dizzy with all these changes haha

This commits also have some changes to basic tasks, as i just removed or moved some redundant code, which i missing in last PR

jmatejcz force-pushed the jm/feat/custom-interfaces-tasks branch from 9326585 to a70250c Compare June 17, 2025 14:20

jmatejcz mentioned this pull request Jul 1, 2025

feat: tool calling benchmark unified across types and prompts variety #620

Merged

jmatejcz force-pushed the jm/feat/basic-tasks branch from 8fa85fd to f84f5b6 Compare July 1, 2025 10:42

jmatejcz changed the base branch from jm/feat/basic-tasks to jm/feat/basic-tasks-extension July 1, 2025 13:20

Base automatically changed from jm/feat/basic-tasks-extension to development July 3, 2025 07:17

jmatejcz added 13 commits July 3, 2025 10:22

feat: redesigned custom interfaces tasks

8efed6f

refactor: corrected custom mocked interfaces

87e6244

chore: add todo

25691e9

feat: added hard tasks

d5712ab

feat: defined hard tasks

14546f3

using const variables for defining

style: removed unnecessery comments

dbb648d

fix: different validators for different variants of Tasks

30fcd24

tests: added tests for predefined custom interfaces tasks

0fb6200

refactor: adjust example in system prompt custom interfaces

7f579e2

feat: flexible timeout

47306f3

fix: default value to service args in mocked tool

9c02324

chore: typo in logging

a36835a

add missing licenses

style: removed redundant comments

d996e02

jmatejcz force-pushed the jm/feat/custom-interfaces-tasks branch from 89293c0 to d996e02 Compare July 3, 2025 08:23

jmatejcz added 5 commits July 3, 2025 12:47

refactor: removed moderate level of prompt detail

7ffbf9c

adjusted prompts

refactor: reduced number of constants

d3d64c3

added default valdiators for some tasks

fix: removed redundant code from basic tasks

dacd3ba

refactor: reduced number of parameters in tasks and added default val…

b76d371

…idators

tests: adjusted tests to changes

607ed04

jmatejcz marked this pull request as ready for review July 4, 2025 12:29

jmatejcz requested a review from boczekbartek July 4, 2025 12:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: tool calling custom interfaces tasks extension #636

feat: tool calling custom interfaces tasks extension #636

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: tool calling custom interfaces tasks extension #636

Are you sure you want to change the base?

feat: tool calling custom interfaces tasks extension #636

Uh oh!

Conversation

Uh oh!

Purpose

Proposed Changes

Issues

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!