A Python library for program synthesis and symbolic execution that combines Z3's constraint solving with LLM-guided synthesis. Put holes in your Python code and let holey
fill them using formal constraints, natural language specifications, or both.
The symbolic execution is
inspired by Philip Zucker's blog post "Symbolic Execution by Overloading __bool__
",
but explores all branches exhaustively instead of randomly and fleshes out the concepts towards solving Python Programming Puzzles.
The solver incorporates heuristics from LLMs in addition to symbolic execution.
python
with support forpip
(e.g.conda
), tested with Python 3.12z3
orcvc5
or both -- on mac with Homebrew, can install withbrew install z3 cvc5
git clone --recursive https://github.com/namin/holey.git
conda create -n holey python=3.12
conda activate holey
pip install -e ".[test,ollama,anthropic]"
For each LLM you want to use, provide an LLM API key, even if only a dummy one is needed. Only provide a key if you want to use that particular LLM provider. All provided keys will be used in parallel to generate a matrix of successes per LLM provider.
export OLLAMA_API_KEY=ollama
export ANTHROPIC_API_KEY=...
export GEMINI_API_KEY=...
export OPENAI_API_KEY=...
python puzzle_solver.py --help
python puzzle_solver.py --name-prefix HelloWorld:0
python puzzle_solver.py >results.txt 2>&
python puzzle_solver.py --name-prefix ListIn:1 --llm
The symbolic execution currently solves:
- 62% (222 out of 360) of
int
puzzles, - 33% (118 out of 363) of
str
puzzles, - 18% (9 out of 51) of
float
puzzles, - 45% (349 out of 774) overall.
with the following errors:
- 9 timeouts after 3 seconds at staging time (while generating the SMTLIB program)
- 118 errors at at staging time
- 175 SMTLIB programs returning
sat
but the originalsat
function failing on synthesized model input, - 181 SMTLIB programs returning non-
sat
(e.g.unsat
,unknown
or timing out after 2 seconds timeouts after staging (while building the SMTLIB program), errors during staging time, the SMTLIB - 941 (out of 1715) puzzles not yet even attempted because their type is not
int
orstr
, such asfloat
,list
(of various specialization), etc.
- 123 smaller problems tried
- 11 successes on smaller problem
- 9 successful extrapolations
Study_1:0 PandigitalSquare:0 CircularShiftNum:2 WeirdDecodeVowels:0 TripleDouble:0 MaxDelta:0 MinConsecutiveSum:2 MaxConsecutiveSum:0 BirthdayParadox:0 BirthdayParadox:1 Tutorial5:0
Study_1:0 PandigitalSquare:0 WeirdDecodeVowels:0 TripleDouble:0 MaxDelta:0 MinConsecutiveSum:2 MaxConsecutiveSum:0 BirthdayParadox:0 Tutorial5:0
- claude (extrapolate) 8 1 1 0 1 1 0 1 1 1 0 1
- claude (end-to-end) 5 1 1 0 1 0 0 0 0 1 1 0
- claude (SMTLIB) 0 0 0 0 0 0 0 0 0 0 0 0
- gemini (extrapolate) 5 0 1 0 1 1 1 1 0 0 0 0
- gemini (end-to-end) 5 0 0 0 0 1 1 0 1 1 0 1
- gemini (SMTLIB) 0 0 0 0 0 0 0 0 0 0 0 0
- ollama (extrapolate) 2 0 0 0 0 0 0 1 0 0 0 1
- ollama (end-to-end) 2 0 0 0 0 0 0 0 0 1 0 1
- ollama (SMTLIB) 0 0 0 0 0 0 0 0 0 0 0 0
.
├──
README.md
├──
benchmarks
│ └──
PythonProgrammingPuzzles
benchmark added as git
submodule
├──
holey
│ ├──
__init__.py
│ ├──
backend.py
backend to SMTLIB batch processes
│ ├──
core.py
includes tracer, symbolic classes, ...
│ ├──
llm.py
support for LLM generation and code extraction
│ └──
preprocessor.py
includes node transformer and sat driver
├──
log
│ └──
results.txt
example run
├──
puzzle_solver.py
main routine for benchmark solver
├──
pyproject.toml
└──
tests
└──
test_core.py
ran with python -m pytest
, basic and LLM-generated
I need help in completely fleshing out the symbolic executor as well as designing and implementing LLM-based heuristics to complement it. See the contributing guidelines, in particular discussing a workflow to find and fix issues driven by the benchmarks.