-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Severe memory usage scaling with multiple ASOF joins in a single query #17001
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @pkhokhlov thanks for opening this issue and providing a detailed reproducer. Did you try DuckDB 1.2.1? Here are my results with 1.2.1, running in a dockerized environment (Ubuntu 24.04 on macOS, Apple Silicon): DuckDB ASOF Join Performance Benchmark
======================================
DuckDB Version: v1.2.1
=============================================
BENCHMARK WITH 2 JOINS
=============================================
Running benchmark with 2 joins, 100000 base rows, 500000 price rows, 1000 IDs
Using random seed: 42
Generating test data...
Running single query benchmark...
Completed in 0.13 seconds
Peak memory increase: 131.55 MB
Running sequential benchmark...
Completed in 0.15 seconds
Peak memory increase: 5.88 MB
Results match!
Row-by-row comparison: Match
Hash-based comparison: Match
Performance comparison:
Runtime: Single query is 1.13x faster than sequential
Memory: Single query uses 22.39x more memory than sequential
==================================================
BENCHMARK SUMMARY
==================================================
DuckDB Version: v1.2.1
Configuration: 2 joins, 100000 base rows, 500000 price rows
Random Seed: 42
Single Query Approach:
Runtime: 0.13 seconds
Peak Memory Increase: 131.55 MB
Sequential Joins Approach:
Runtime: 0.15 seconds
Peak Memory Increase: 5.88 MB
Comparison (Single vs Sequential):
Runtime: Single query is 1.13x faster than sequential
Memory: Single query uses 22.39x more memory than sequential
Results from both approaches MATCH.
Row-by-row comparison successful
Hash-based comparison successful
=============================================
BENCHMARK WITH 4 JOINS
=============================================
Running benchmark with 4 joins, 100000 base rows, 500000 price rows, 1000 IDs
Using random seed: 42
Generating test data...
Running single query benchmark...
Completed in 0.21 seconds
Peak memory increase: 462.39 MB
Running sequential benchmark...
Completed in 0.33 seconds
Peak memory increase: 2.62 MB
Results match!
Row-by-row comparison: Match
Hash-based comparison: Match
Performance comparison:
Runtime: Single query is 1.61x faster than sequential
Memory: Single query uses 176.15x more memory than sequential
==================================================
BENCHMARK SUMMARY
==================================================
DuckDB Version: v1.2.1
Configuration: 4 joins, 100000 base rows, 500000 price rows
Random Seed: 42
Single Query Approach:
Runtime: 0.21 seconds
Peak Memory Increase: 462.39 MB
Sequential Joins Approach:
Runtime: 0.33 seconds
Peak Memory Increase: 2.62 MB
Comparison (Single vs Sequential):
Runtime: Single query is 1.61x faster than sequential
Memory: Single query uses 176.15x more memory than sequential
Results from both approaches MATCH.
Row-by-row comparison successful
Hash-based comparison successful
=============================================
BENCHMARK WITH 6 JOINS
=============================================
Running benchmark with 6 joins, 100000 base rows, 500000 price rows, 1000 IDs
Using random seed: 42
Generating test data...
Running single query benchmark...
Completed in 0.27 seconds
Peak memory increase: 819.13 MB
Running sequential benchmark...
Completed in 0.58 seconds
Peak memory increase: 37.89 MB
Results match!
Row-by-row comparison: Match
Hash-based comparison: Match
Performance comparison:
Runtime: Single query is 2.18x faster than sequential
Memory: Single query uses 21.62x more memory than sequential
==================================================
BENCHMARK SUMMARY
==================================================
DuckDB Version: v1.2.1
Configuration: 6 joins, 100000 base rows, 500000 price rows
Random Seed: 42
Single Query Approach:
Runtime: 0.27 seconds
Peak Memory Increase: 819.13 MB
Sequential Joins Approach:
Runtime: 0.58 seconds
Peak Memory Increase: 37.89 MB
Comparison (Single vs Sequential):
Runtime: Single query is 2.18x faster than sequential
Memory: Single query uses 21.62x more memory than sequential
Results from both approaches MATCH.
Row-by-row comparison successful
Hash-based comparison successful
Benchmark complete. Full results saved in benchmark_results/ |
@szarnyasg yes, I see the same behavior that you also reproduced above
|
Thanks for the report! This was something dumb (not releasing memory when we are done with it). Very similar to what we do in the window operator. With the fix, I measure peak memory usage with Instruments at about 100MB. |
* Free memory right after we are done with it fixes: #17001 fixes: duckdblabs/duckdb-internal#4586
Issue duckdb/duckdb#17001: AsOf memory Management (duckdb/duckdb#17028)
Issue duckdb/duckdb#17001: AsOf memory Management (duckdb/duckdb#17028)
Issue duckdb/duckdb#17001: AsOf memory Management (duckdb/duckdb#17028)
Issue duckdb/duckdb#17001: AsOf memory Management (duckdb/duckdb#17028)
Issue duckdb/duckdb#17001: AsOf memory Management (duckdb/duckdb#17028)
Uh oh!
There was an error while loading. Please reload this page.
What happens?
When executing multiple ASOF joins in a single query in DuckDB, memory usage increases significantly with the number of joins while sequential execution of individual ASOF joins maintains consistent low memory usage. The single query approach consumes 12-80x more memory as join count increases.
Here is an example of such joins followed by a benchmark illustrating the performance differences.
To Reproduce
I've created a benchmark script that compares two approaches for executing multiple ASOF joins:
The benchmark shows that memory usage for the single query approach grows significantly with the number of joins, while sequential joins maintain consistent memory usage.
Running
./run_benchmarks.sh
yields:Note the growth in memory usage for the single query approach:
While the sequential approach maintains consistent memory usage (~7-9 MB) regardless of join count.
This memory growth becomes particularly problematic with larger datasets, where I've observed that the single join approach causes spillover to disk, resulting in extreme performance degradation.
OS:
Linux
DuckDB Version:
1.1.1
DuckDB Client:
Python
Hardware:
No response
Full Name:
Pavel Khokhlov
Affiliation:
personal
What is the latest build you tested with? If possible, we recommend testing with the 8000 latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
The text was updated successfully, but these errors were encountered: