GameBench: Benchmarking LLMs via Games

GameBench is an innovative new project to benchmark Large Language Models (LLMs) using a diverse set of games. Why? Benchmarking via games has advantages over traditional knowledge-based benchmarks such as MMLU, GPQA, HumanEval, etc.

Far Wider Skill Range: Game benchmarks uniquely possess both a lower skill floor and higher skill ceiling compared to traditional knowledge assessments. Standard benchmarks often quickly saturate, with frontier models rapidly escalating from initial low scores (5-20%) to near-peak performance (80-90%), diminishing their discriminative power. Games, conversely, maintain a long-term capability to distinguish nuanced performance differences, effectively differentiating between models across multiple generations.
Resistance to Direct Optimization: Ironically, Game Benchmarks are the least gameable. Unlike knowledge-based tests, which models can easily be fine-tuned on, improving scores in game benchmarks requires the sophisticated development of reinforcement learning environments encompassing multiple varied games.
Robust Against Data Contamination: Similar to the above, Game Benchmarks are more or less contamination proof. Typically, the strategic knowledge required to play these games is already available to LLMs through their training data, making direct memorization ineffective.
High Extensibility and Customizability: Compared to more traditional benchmarks, game benchmarks can easily be made harder (or easier) by simple extensions. For example, one could test adaptability by having two model's repeatedly play other and carry over context to understand if models are capable of adapting and exploiting weaknesses in the other's gameplay. Or one could easily take a game that can support multiple players and modify the benchmark by increasing (or decreasing) the # of players—thereby requiring models to deal with a more (or less) complex game state.
Economical Benchmark Development: As frontier models becoming increasingly capable, new challenging knowledge based benchmarks are only capable of being created by the world's most leading experts meaning that any new questions or benchmarks are extremely resource-intensive to produce.

First Game: Liar's Poker

Liar's Poker is the first game of choice. Adapted from the famous game described by Michael Lewis, it is a simple game of statistical reasoning with bluffing. Each player is dealt a hidden 8-digit “hand” (random digits 0–9). Players take turns bidding on how many total instances of a specific digit (e.g., “five 7s”) they believe exist across all hands. Bids must increase in quantity and/or digit. On their turn, a player may raise the bid or issue a challenge—disputing the previous bid. If challenged, all hands are revealed: if the bid is valid, the bidder wins; if not, the challenger wins. The game tests bluffing, inference, and probabilistic reasoning.

Evaluations and Scores

Models are evaluated using the Microsoft TrueSkill algorithm.

The TrueSkill Leaderboard represents models with 2 metrics:

μ: Skill Rating quantifying how well the model performs.
σ: Uncertainty Rating quantifying how uncertain we are about overall model performance.

Inspiration for using TrueSkill and Scoreboard design is from lechmazur who has also tested LLMs on many games such as NYT Connections & The Elimination Game.

Name		Name	Last commit message	Last commit date
Latest commit History 191 Commits
data		data
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
elo_result_visualizer.py		elo_result_visualizer.py
evaluate_elo_liars_poker_heads_up.py		evaluate_elo_liars_poker_heads_up.py
evaluate_hold_em_hu.py		evaluate_hold_em_hu.py
evaluate_trueskill.py		evaluate_trueskill.py
heads_up_holdem_pairs.json		heads_up_holdem_pairs.json
heads_up_texas_hold_em_game.py		heads_up_texas_hold_em_game.py
liars_poker_game.py		liars_poker_game.py
liars_poker_multi_game.py		liars_poker_multi_game.py
llm_client.py		llm_client.py
model_rankings.png		model_rankings.png
model_ratings.json		model_ratings.json
model_ratings_holdem_hu.json		model_ratings_holdem_hu.json
model_trueskill_ratings.json		model_trueskill_ratings.json
requirements.txt		requirements.txt
run_parallel_games_holdem_hu.sh		run_parallel_games_holdem_hu.sh
run_parallel_games_liars_poker_hu.sh		run_parallel_games_liars_poker_hu.sh
run_parallel_games_liars_poker_multi.sh		run_parallel_games_liars_poker_multi.sh
todo.txt		todo.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GameBench: Benchmarking LLMs via Games

First Game: Liar's Poker

Evaluations and Scores

About

Uh oh!

Releases

Packages

Uh oh!

Languages

aishvar/gamebench

Folders and files

Latest commit

History

Repository files navigation

GameBench: Benchmarking LLMs via Games

First Game: Liar's Poker

Evaluations and Scores

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages