Collections of games to be run with the clemcore framework
Install the required dependencies to run all games:
pip install -r clembench/requirements.txt
This will also install the clem
CLI tool.
After the installation you will have access to the clem
CLI tool. The main functions are:
(myclem) clem list games # list the games available for a run
(myclem) clem list backends # list the backends available for a run
(myclem) clem list models # list the models available for a run
(myclem) clem run -g <game> -m <model> # runs the game benchmark; also scores
(myclem) clem transcribe # translates interactions into html files
(myclem) clem score # computes individual performance measures
(myclem) clem eval # computes overall performances measures; requires scores
To add new custom models, populate the model_registry.json
file with the required fields (template is provided as model_registry.json.template).
To run your custom game, populate the game_registry.json
file with the required fields and directory path (template is provided as game_registry.json.template).
To use APIs (OpenAI, Anthropic, Google, Mistral, etc.), create a key.json
file that includes the required fields for each backend.
The template file (key.json.template) is provided.
Copy the file into <userhome>/.clemcore/
to make it generally available.
We recommend creating a specific workspace directory to work with clembench, which contains the benchmark game directories and optional files.
The clem
CLI command operates relative to the current working directory, that is, the directory it is called from.
The workspace directory serves as a convenient working directory.
Workspace directory contents may look like this:
(optional) key.json
(optional) game_registry.json
(optional) model_registry.json
(optional) custom_api.py
clembench/
The files have the following functions:
- key.json: Contains secrets for the remote API calls; if this file does not exist, then
clem
looks into~/.clemcore/
. - game_registry.json: Allows to make additional game specifications usable for the runs. The game specifications
must at least contain the
game_name
,game_path
andplayers
attribute. - model_registry.json: Allows to add additional model specifications. This is specifically useful to run with models that have not been packaged yet. In addition, it allows to point model specification to custom backend names.
- custom_api.py:
clem
automatically discovers additional _api files placed into the cwd, so that users of the framework can run their own backends with the games. - clembench/: Contains the game directories (with the game code) available for the benchmark runs.
Note that clem
does automatically discover game directories that are at most 3-levels away from the current working directory/cwd
.
To be discoverable, game directories have to carry a clemgame.json
(here a game path is not required, because clem
automatically determines it).
To prepare running multiple models for all games that constitute the Clembench benchmark, checkout the clembench repository into a new workspace directory.
To access remote API backends, add a key.json
containing the respective API access keys to the workspace directory.
In addition, you might need to add additional model entries that are not yet packaged to a model_registry.json
.
To run all available games on for example model1
, execute clem run -g all -m model1
in a terminal. The example model1
is the key string for the model to be run in the model registry (either packaged clemcore/clemcore/backends/model_registry.json
or custom model_registry.json
in the workspace directory).
To run multiple models, we currently recommend using a batch script containing multiple clem CLI calls, one for each model.
By default, result files will be stored in the current working directory, in the results
subdirectory.
Results can be stored in a different directory by executing clem run -g all -m model1 -r <other_directory>
, with <other_directory>
being the path to the target directory.
Hence, a benchmarking workspace directory might look as follows:
myworkspace
- clembench/
- results/
- key.json
- model_registry.json
To test the performance of your own model on the benchmark, checkout the clembench repository into your workspace directory.
To make your model available to clem
, create a model_registry.json
with the specifications of your model in the working directory.
The model registry entry must at least specify a name and a backend:
{
"model_name":"mymodel",
"backend":"mybackend"
}
More information on the model registry is available in the model registry and backends readme.
If your model is not compatible with the packaged local backends (HuggingFace transformers
, llama-cpp-python
, vLLM
) it requires a custom backend.
In this case, create a mybackend_api.py
in the workspace directory which implements the generate_response
method for the model and might specify how it is loaded.
All backend module files must be named <backend name>_api.py
, with <backend name>
being the backend to refer to in the model registry.
For more information on custom backends, see the adding models and backends howto.
clem
tries to locate all non-package backend modules in the workspace directory. Therefore, your model's registry
entry must match the backend module, with the model entry "backend"
key value matching the <backend name>
of your
custom <backend name>_api.py
backend module file.
Run clem -g all -m mymodel
from the workspace directory to run your model on all games. The results will be stored in the results
subdirectory.
Hence, a model developers workspace might look as follows:
myworkspace
- clembench/
- results/
- model_registry.json
- mybackend_api.py
The use case for training models on data generated by gameplay is available at the playpen repository (still under development).
To implement your own game to be run with clem
, we recommend using a typical clem game project structure, with the game directory as your workspace directory.
To make the game visible to clem
you need to add a clemgame.json
to the directory.
This file must specify at least the following (possible values separated by |
):
{
"game_name": "mygame",
"description": "A brief description of mygame",
"player": "single" | "two" | "multi",
"image": "none" | "single" | "multi",
"languages": ["en"]
}
To test your game with a packaged model, run the command clem run -g mygame -m model
from within the game directory.
The results will be written into a results
subdirectory.
To generate HTML transcripts of your game run's episodes run clem transcribe -g mygame
.
To use remote API backends, add a key.json
with your remote
API access key(s) to the workspace directory.
Overall, a game developers workspace directory will possibly look as follows:
mygame
- in/
- resources/
- results/
- __init__.py
- master.py
- instancegenerator.py
- clemgame.json
- key.json
For more information on creating and adding clemgames, see howto add games, howto add games example, howto prototype games and the logging and scoring docs.
We welcome you to contribute to or extend the benchmark with your own games and models. Please open a pull request in the respective repository. You can find more information on how to use the benchmark in the links below.
However, the following documentation needs still to be checked for up-to-dateness.
- How to run the benchmark and evaluation locally
- How to run the benchmark, update leaderboard workflow
- How to add a new model
- How to add and run your own game
The clembench release versions tags are structured like major.minor.patch
where
major
indicates the compatibility with a major clemcore version e.g.2.x.x
is only compatible with clemcore versions2.x.x
minor
indicates changes to the games in the benchmark that don't affect compatibility with clemcore e.g. refactorings, additions or removals of gamespatch
indicates smaller adjustments or hotfixes to the games that don't affect compatibility with clemcore
The following image visualizes the established dependencies and version history: