8000 Severe memory usage scaling with multiple ASOF joins in a single query · Issue #17001 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Severe memory usage scaling with multiple ASOF joins in a single query #17001

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks done
pkhokhlov opened this issue Apr 5, 2025 · 3 comments · Fixed by #17028
Closed
2 tasks done

Severe memory usage scaling with multiple ASOF joins in a single query #17001

pkhokhlov opened this issue Apr 5, 2025 · 3 comments · Fixed by #17028

Comments

@pkhokhlov
Copy link
pkhokhlov commented Apr 5, 2025

What happens?

When executing multiple ASOF joins in a single query in DuckDB, memory usage increases significantly with the number of joins while sequential execution of individual ASOF joins maintains consistent low memory usage. The single query approach consumes 12-80x more memory as join count increases.

Here is an example of such joins followed by a benchmark illustrating the performance differences.

CREATE TABLE result_single AS 
SELECT b.*, 
       p0.price AS price0, 
       p1.price AS price1
FROM base_single b
ASOF LEFT JOIN prices p0 
    ON p0.id = b.id AND p0.epochNs <= (b.epochNs + 1000000)
ASOF LEFT JOIN prices p1 
    ON p1.id = b.id AND p1.epochNs <= (b.epochNs + 2000000)

CREATE TABLE result_sequential AS
SELECT b.*, 
       p0.price AS price0
FROM base_original b
ASOF LEFT JOIN prices p0 
    ON p0.id = b.id AND p0.epochNs <= (b.epochNs + 1000000)

CREATE OR REPLACE TABLE result_sequential AS
SELECT b.*, 
       b.price0, 
       p1.price AS price1
FROM result_sequential b
ASOF LEFT JOIN prices p1 
    ON p1.id = b.id AND p1.epochNs <= (b.epochNs + 2000000)

To Reproduce

I've created a benchmark script that compares two approaches for executing multiple ASOF joins:

  1. Multiple ASOF joins in a single query
  2. Sequential execution of individual ASOF joins

The benchmark shows that memory usage for the single query approach grows significantly with the number of joins, while sequential joins maintain consistent memory usage.

Running ./run_benchmarks.sh yields:

DuckDB ASOF Join Performance Benchmark
======================================
DuckDB Version: v1.1.1

=============================================
BENCHMARK WITH 2 JOINS
=============================================
Running benchmark with 2 joins, 100000 base rows, 500000 price rows, 1000 IDs
Using random seed: 42
Generating test data...
Running single query benchmark...
  Completed in 0.31 seconds
  Peak memory increase: 94.69 MB
Running sequential benchmark...
  Completed in 0.35 seconds
  Peak memory increase: 7.73 MB

Results match!
  Row-by-row comparison: Match
  Hash-based comparison: Match

Performance comparison:
  Runtime: Single query is 1.16x faster than sequential
  Memory: Single query uses 12.25x more memory than sequential

==================================================
BENCHMARK SUMMARY
==================================================
DuckDB Version: v1.1.1
Configuration: 2 joins, 100000 base rows, 500000 price rows
Random Seed: 42

Single Query Approach:
  Runtime: 0.31 seconds
  Peak Memory Increase: 94.69 MB

Sequential Joins Approach:
  Runtime: 0.35 seconds
  Peak Memory Increase: 7.73 MB

Comparison (Single vs Sequential):
  Runtime: Single query is 1.16x faster than sequential
  Memory: Single query uses 12.25x more memory than sequential

Results from both approaches MATCH.
  Row-by-row comparison successful
  Hash-based comparison successful

=============================================
BENCHMARK WITH 4 JOINS
=============================================
Running benchmark with 4 joins, 100000 base rows, 500000 price rows, 1000 IDs
Using random seed: 42
Generating test data...
Running single query benchmark...
  Completed in 0.50 seconds
  Peak memory increase: 337.70 MB
Running sequential benchmark...
  Completed in 0.74 seconds
  Peak memory increase: 9.07 MB

Results match!
  Row-by-row comparison: Match
  Hash-based comparison: Match

Performance comparison:
  Runtime: Single query is 1.48x faster than sequential
  Memory: Single query uses 37.25x more memory than sequential

==================================================
BENCHMARK SUMMARY
==================================================
DuckDB Version: v1.1.1
Configuration: 4 joins, 100000 base rows, 500000 price rows
Random Seed: 42

Single Query Approach:
  Runtime: 0.50 seconds
  Peak Memory Increase: 337.70 MB

Sequential Joins Approach:
  Runtime: 0.74 seconds
  Peak Memory Increase: 9.07 MB

Comparison (Single vs Sequential):
  Runtime: Single query is 1.48x faster than sequential
  Memory: Single query uses 37.25x more memory than sequential

Results from both approaches MATCH.
  Row-by-row comparison successful
  Hash-based comparison successful

=============================================
BENCHMARK WITH 6 JOINS
=============================================
Running benchmark with 6 joins, 100000 base rows, 500000 price rows, 1000 IDs
Using random seed: 42
Generating test data...
Running single query benchmark...
  Completed in 0.79 seconds
  Peak memory increase: 676.42 MB
Running sequential benchmark...
  Completed in 1.31 seconds
  Peak memory increase: 7.67 MB

Results match!
  Row-by-row comparison: Match
  Hash-based comparison: Match

Performance comparison:
  Runtime: Single query is 1.65x faster than sequential
  Memory: Single query uses 88.21x more memory than sequential

==================================================
BENCHMARK SUMMARY
==================================================
DuckDB Version: v1.1.1
Configuration: 6 joins, 100000 base rows, 500000 price rows
Random Seed: 42

Single Query Approach:
  Runtime: 0.79 seconds
  Peak Memory Increase: 676.42 MB

Sequential Joins Approach:
  Runtime: 1.31 seconds
  Peak Memory Increase: 7.67 MB

Comparison (Single vs Sequential):
  Runtime: Single query is 1.65x faster than sequential
  Memory: Single query uses 88.21x more memory than sequential

Results from both approaches MATCH.
  Row-by-row comparison successful
  Hash-based comparison successful

Benchmark complete. Full results saved in benchmark_results/

Note the growth in memory usage for the single query approach:

  • 2 joins: 94.69 MB
  • 4 joins: 337.70 MB
  • 6 joins: 676.42 MB

While the sequential approach maintains consistent memory usage (~7-9 MB) regardless of join count.

This memory growth becomes particularly problematic with larger datasets, where I've observed that the single join approach causes spillover to disk, resulting in extreme performance degradation.

#!/usr/bin/env python3
"""
DuckDB ASOF Join Performance Benchmark

Measures and compares performance of multiple ASOF joins in a single query
versus sequential execution of individual ASOF joins.
"""
import argparse
import duckdb
import time
import pandas as pd
import random
import sys
import gc
import resource

def measure_peak_memory(func, *args, **kwargs):
    """
    Wrapper to measure true peak memory usage of a function using resource module.
    Returns both the function result and the peak memory usage.
    """
    # Force garbage collection before measurement
    gc.collect()
    
    # Get initial memory usage
    initial_memory = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    if sys.platform != 'darwin':
        # On Linux, ru_maxrss is in KB, convert to MB
        initial_memory = initial_memory / 1024
    else:
        # On macOS, ru_maxrss is in bytes, convert to MB
        initial_memory = initial_memory / (1024 * 1024)
    
    # Run the target function
    start_time = time.time()
    result = func(*args, **kwargs)
    runtime = time.time() - start_time
    
    # Get peak memory usage
    peak_memory = resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
    if sys.platform != 'darwin':
        # On Linux, ru_maxrss is in KB, convert to MB
        peak_memory = peak_memory / 1024
    else:
        # On macOS, ru_maxrss is in bytes, convert to MB
        peak_memory = peak_memory / (1024 * 1024)
    
    # Calculate peak memory increase
    memory_increase = max(0.1, peak_memory - initial_memory)  # Ensure non-zero
    
    return result, runtime, memory_increase

def benchmark_single_query(conn, num_joins, time_offsets):
    """Run benchmark with multiple ASOF joins in a single query."""
    # Clean up previous tables
    conn.execute("DROP TABLE IF EXISTS result_single")
    conn.execute("DROP TABLE IF EXISTS base_single")
    
    # Create fresh copy of base table
    conn.execute("CREATE TABLE base_single AS SELECT * FROM base_original")
    
    # Build the query
    query = "CREATE TABLE result_single AS SELECT b.*"
    for i in range(num_joins):
        query += f", p{i}.price AS price{i}"
    query += " FROM base_single b"
    for i in range(num_joins):
        offset = time_offsets[i]
        query += f" ASOF LEFT JOIN prices p{i} ON p{i}.id = b.id AND p{i}.epochNs <= (b.epochNs + {offset})"
    
    # Execute query and measure memory usage
    def execute_query():
        conn.execute(query)
    
    _, runtime, peak_memory = measure_peak_memory(execute_query)
    
    return {
        'runtime': runtime,
        'memory': peak_memory
    }

def benchmark_sequential(conn, num_joins, time_offsets):
    """Run benchmark with sequential ASOF joins."""
    # Clean up previous table
    conn.execute("DROP TABLE IF EXISTS result_sequential")
    
    # Create fresh copy of base table
    conn.execute("CREATE TABLE result_sequential AS SELECT * FROM base_original")
    
    # Function to run the sequential joins
    def execute_sequential_joins():
        # Run each join sequentially
        for i in range(num_joins):
            offset = time_offsets[i]
            
            # Build query with proper column handling
            cols = "SELECT b.*"
            if i > 0:
                cols += ", " + ", ".join([f"b.price{j}" for j in range(i)])
            cols += f", p{i}.price AS price{i}"
            
            query = f"""
            CREATE OR REPLACE TABLE result_sequential AS
            {cols}
            FROM result_sequential b
            ASOF LEFT JOIN prices p{i} ON p{i}.id = b.id AND p{i}.epochNs <= (b.epochNs + {offset})
            """
            conn.execute(query)
    
    # Execute sequential joins and measure memory usage
    _, runtime, peak_memory = measure_peak_memory(execute_sequential_joins)
    
    return {
        'runtime': runtime,
        'memory': peak_memory
    }

def verify_results(conn, num_joins):
    """Verify that both approaches produce the same results using multiple methods."""
    try:
        # Method 1: Row-by-row comparison
        conditions = []
        for i in range(num_joins):
            conditions.append(f"a.price{i} IS NOT DISTINCT FROM b.price{i}")
        match_clause = " AND ".join(conditions)
        
        conn.execute(f"""
        SELECT 
            COUNT(*) AS total_rows,
            SUM(CASE WHEN {match_clause} THEN 1 ELSE 0 END) AS matching_rows
        FROM result_single a
        FULL JOIN result_sequential b ON a.row_id = b.row_id
        """)
        
        row_verification = conn.fetchone()
        row_match = row_verification[0] == row_verification[1]
        
        # Method 2: Hash-based comparison
        # Create hash function that handles NULL values for each table
        hash_columns = []
        for i in range(num_joins):
            # COALESCE converts NULL to 0 for consistent hashing
            hash_columns.append(f"COALESCE(price{i}, -1)")
        
        hash_expr = ", ".join(hash_columns)
        
        # Calculate hash sum for single query results
        conn.execute(f"""
        SELECT SUM(HASH({hash_expr})) AS hash_sum
        FROM result_single
        """)
        single_hash = conn.fetchone()[0]
        
        # Calculate hash sum for sequential query results
        conn.execute(f"""
        SELECT SUM(HASH({hash_expr})) AS hash_sum
        FROM result_sequential
        """)
        sequential_hash = conn.fetchone()[0]
        
        hash_match = single_hash == sequential_hash
        
        return {
            'row_comparison': {
                'total_rows': row_verification[0],
                'matching_rows': row_verification[1],
                'match': row_match
            },
            'hash_comparison': {
                'single_hash_sum': single_hash,
                'sequential_hash_sum': sequential_hash,
                'match': hash_match
            },
            'match': row_match and hash_match
        }
    except Exception as e:
        return {
            'error': str(e),
            'match': False
        }

def run_benchmark(num_base_rows=100000, num_price_rows=500000, num_ids=1000, 
                  num_joins=4, verbose=True, seed=None):
    """Run the benchmark with specified parameters."""
    if verbose:
        print(f"Running benchmark with {num_joins} joins, {num_base_rows} base rows, "
              f"{num_price_rows} price rows, {num_ids} IDs")
        if seed is not None:
            print(f"Using random seed: {seed}")
    
    # Set random seed if specified
    if seed is not None:
        random.seed(seed)
    
    # Create database connection
    conn = duckdb.connect(':memory:')
    conn.execute("PRAGMA disable_progress_bar")
    
    # Time offsets for joins (1ms, 2ms, etc.)
    time_offsets = [(i+1) * 1000000 for i in range(num_joins)]
    
    # Generate test data
    if verbose:
        print("Generating test data...")
    
    # Create IDs
    ids = [f"ID_{i}" for i in range(num_ids)]
    
    # Generate base data
    base_data = [{'row_id': i+1, 
                  'id': random.choice(ids), 
                  'epochNs': random.randint(0, 86400000000000)} 
                 for i in range(num_base_rows)]
    base_df = pd.DataFrame(base_data)
    
    # Generate price data
    price_data = [{'row_id': i+1, 
                   'id': random.choice(ids), 
                   'epochNs': random.randint(0, 86400000000000),
                   'price': random.random() * 1000} 
                  for i in range(num_price_rows)]
    price_df = pd.DataFrame(price_data)
    
    # Load data into DuckDB
    conn.execute("CREATE TABLE base AS SELECT * FROM base_df ORDER BY id, epochNs")
    conn.execute("CREATE TABLE prices AS SELECT * FROM price_df ORDER BY id, epochNs")
    conn.execute("CREATE TABLE base_original AS SELECT * FROM base")
    
    # Initialize results
    results = {
        'parameters': {
            'base_rows': num_base_rows,
            'price_rows': num_price_rows,
            'ids': num_ids,
            'joins': num_joins,
            'seed': seed
        },
        'duckdb_version': conn.execute("SELECT version()").fetchone()[0]
    }
    
    if verbose:
        print("Running single query benchmark...")
    try:
        single_results = benchmark_single_query(conn, num_joins, time_offsets)
        results['single'] = {
            'status': 'success',
            'runtime': single_results['runtime'],
            'memory': single_results['memory']
        }
        if verbose:
            print(f"  Completed in {single_results['runtime']:.2f} seconds")
            print(f"  Peak memory increase: {single_results['memory']:.2f} MB")
    except Exception as e:
        results['single'] = {
            'status': 'failed',
            'error': str(e)
        }
        if verbose:
            print(f"  Error: {str(e)}")
    
    if verbose:
        print("Running sequential benchmark...")
    try:
        sequential_results = benchmark_sequential(conn, num_joins, time_offsets)
        results['sequential'] = {
            'status': 'success',
            'runtime': sequential_results['runtime'],
            'memory': sequential_results['memory']
        }
        if verbose:
            print(f"  Completed in {sequential_results['runtime']:.2f} seconds")
            print(f"  Peak memory increase: {sequential_results['memory']:.2f} MB")
    except Exception as e:
        results['sequential'] = {
            'status': 'failed',
            'error': str(e)
        }
        if verbose:
            print(f"  Error: {str(e)}")
    
    # Verify results
    if results['single']['status'] == 'success' and results['sequential']['status'] == 'success':
        verification = verify_results(conn, num_joins)
        results['verification'] = verification
        
        # Calculate ratios
        if results['sequential']['runtime'] > 0 and results['single']['runtime'] > 0:
            # Single query approach is X times faster than sequential approach
            speedup = results['sequential']['runtime'] / results['single']['runtime']
            # Single query approach uses X times more memory than sequential approach
            memory_ratio = results['single']['memory'] / results['sequential']['memory']
            
            results['comparison'] = {
                'speedup': speedup,
                'memory_ratio': memory_ratio
            }
            
            if verbose:
                print(f"\nResults {'match' if verification.get('match', False) else 'DO NOT match'}!")
                if 'hash_comparison' in verification:
                    hash_match = verification['hash_comparison'].get('match', False)
                    print(f"  Row-by-row comparison: {'Match' if verification['row_comparison']['match'] else 'Differ'}")
                    print(f"  Hash-based comparison: {'Match' if hash_match else 'Differ'}")
                print(f"\nPerformance comparison:")
                print(f"  Runtime: Single query is {speedup:.2f}x faster than sequential")
                print(f"  Memory: Single query uses {memory_ratio:.2f}x more memory than sequential")
    
    return results

def main():
    parser = argparse.ArgumentParser(description="DuckDB ASOF Join Performance Benchmark")
    parser.add_argument("--base-rows", type=int, default=100000, help="Number of rows in base table")
    parser.add_argument("--price-rows", type=int, default=500000, help="Number of rows in prices table")
    parser.add_argument("--ids", type=int, default=1000, help="Number of unique IDs")
    parser.add_argument("--joins", type=int, default=4, help="Number of ASOF joins to perform")
    parser.add_argument("--seed", type=int, default=None, help="Random seed for reproducibility")
    
    args = parser.parse_args()
    
    results = run_benchmark(
        num_base_rows=args.base_rows,
        num_price_rows=args.price_rows,
        num_ids=args.ids,
        num_joins=args.joins,
        seed=args.seed
    )
    
    print("\n" + "="*50)
    print("BENCHMARK SUMMARY")
    print("="*50)
    print(f"DuckDB Version: {results['duckdb_version']}")
    print(f"Configuration: {results['parameters']['joins']} joins, "
          f"{results['parameters']['base_rows']} base rows, "
          f"{results['parameters']['price_rows']} price rows")
    if results['parameters']['seed'] is not None:
        print(f"Random Seed: {results['parameters']['seed']}")
    
    if 'single' in results:
        print("\nSingle Query Approach:")
        if results['single']['status'] == 'success':
            print(f"  Runtime: {results['single']['runtime']:.2f} seconds")
            print(f"  Peak Memory Increase: {results['single']['memory']:.2f} MB")
        else:
            print(f"  Failed: {results['single'].get('error', 'Unknown error')}")
    
    if 'sequential' in results:
        print("\nSequential Joins Approach:")
        if results['sequential']['status'] == 'success':
            print(f"  Runtime: {results['sequential']['runtime']:.2f} seconds")
            print(f"  Peak Memory Increase: {results['sequential']['memory']:.2f} MB")
        else:
            print(f"  Failed: {results['sequential'].get('error', 'Unknown error')}")
    
    if 'comparison' in results:
        print("\nComparison (Single vs Sequential):")
        print(f"  Runtime: Single query is {results['comparison']['speedup']:.2f}x faster than sequential")
        print(f"  Memory: Single query uses {results['comparison']['memory_ratio']:.2f}x more memory than sequential")
        
        if 'verification' in results:
            if results['verification'].get('match', False):
                print("\nResults from both approaches MATCH.")
                if 'hash_comparison' in results['verification']:
                    print("  Row-by-row comparison successful")
                    print("  Hash-based comparison successful")
            else:
                print("\nWARNING: Results from both approaches DO NOT match!")
                if 'row_comparison' in results['verification'] and 'hash_comparison' in results['verification']:
                    print(f"  Row comparison: {'Match' if results['verification']['row_comparison']['match'] else 'Differ'}")
                    print(f"  Hash comparison: {'Match' if results['verification']['hash_comparison']['match'] else 'Differ'}")
    

    if (results.get('single', {}).get('status') == 'success' and 
        results.get('sequential', {}).get('status') == 'success'):
        return 0
    return 1

if __name__ == "__main__":
    sys.exit(main())
#!/bin/bash
# DuckDB ASOF Join Performance Benchmark Runner
# With reproducible random seed

set -e

# Check if Python script exists
BENCHMARK_SCRIPT="benchmark_asof_joins.py"
if [ ! -f "$BENCHMARK_SCRIPT" ]; then
    echo "Error: Cannot find benchmark script: $BENCHMARK_SCRIPT"
    exit 1
fi

# Make script executable
chmod +x "$BENCHMARK_SCRIPT"

# Create results directory
mkdir -p benchmark_results

echo "DuckDB ASOF Join Performance Benchmark"
echo "======================================"

# Get DuckDB version
VERSION=$(python -c "import duckdb; print(duckdb.connect(':memory:').execute('SELECT version()').fetchone()[0])")
echo "DuckDB Version: $VERSION"
echo

# Define benchmark configurations
JOIN_COUNTS=(2 4 6)
BASE_ROWS=100000
PRICE_ROWS=500000
IDS=1000
SEED=42  # Fixed seed for reproducibility

# Run benchmarks for each join count
for JOINS in "${JOIN_COUNTS[@]}"; do
    echo "============================================="
    echo "BENCHMARK WITH $JOINS JOINS"
    echo "============================================="
    
    # Run benchmark with text output for user display
    ./$BENCHMARK_SCRIPT --joins $JOINS --base-rows $BASE_ROWS --price-rows $PRICE_ROWS --ids $IDS --seed $SEED
    
    echo
done

echo "Benchmark complete. Full results saved in benchmark_results/"

OS:

Linux

DuckDB Version:

1.1.1

DuckDB Client:

Python

Hardware:

No response

Full Name:

Pavel Khokhlov

Affiliation:

personal

What is the latest build you tested with? If possible, we recommend testing with the 8000 latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
@szarnyasg
Copy link
Collaborator

Hi @pkhokhlov thanks for opening this issue and providing a detailed reproducer. Did you try DuckDB 1.2.1?

Here are my results with 1.2.1, running in a dockerized environment (Ubuntu 24.04 on macOS, Apple Silicon):

DuckDB ASOF Join Performance Benchmark
======================================
DuckDB Version: v1.2.1

=============================================
BENCHMARK WITH 2 JOINS
=============================================
Running benchmark with 2 joins, 100000 base rows, 500000 price rows, 1000 IDs
Using random seed: 42
Generating test data...
Running single query benchmark...
  Completed in 0.13 seconds
  Peak memory increase: 131.55 MB
Running sequential benchmark...
  Completed in 0.15 seconds
  Peak memory increase: 5.88 MB

Results match!
  Row-by-row comparison: Match
  Hash-based comparison: Match

Performance comparison:
  Runtime: Single query is 1.13x faster than sequential
  Memory: Single query uses 22.39x more memory than sequential

==================================================
BENCHMARK SUMMARY
==================================================
DuckDB Version: v1.2.1
Configuration: 2 joins, 100000 base rows, 500000 price rows
Random Seed: 42

Single Query Approach:
  Runtime: 0.13 seconds
  Peak Memory Increase: 131.55 MB

Sequential Joins Approach:
  Runtime: 0.15 seconds
  Peak Memory Increase: 5.88 MB

Comparison (Single vs Sequential):
  Runtime: Single query is 1.13x faster than sequential
  Memory: Single query uses 22.39x more memory than sequential

Results from both approaches MATCH.
  Row-by-row comparison successful
  Hash-based comparison successful

=============================================
BENCHMARK WITH 4 JOINS
=============================================
Running benchmark with 4 joins, 100000 base rows, 500000 price rows, 1000 IDs
Using random seed: 42
Generating test data...
Running single query benchmark...
  Completed in 0.21 seconds
  Peak memory increase: 462.39 MB
Running sequential benchmark...
  Completed in 0.33 seconds
  Peak memory increase: 2.62 MB

Results match!
  Row-by-row comparison: Match
  Hash-based comparison: Match

Performance comparison:
  Runtime: Single query is 1.61x faster than sequential
  Memory: Single query uses 176.15x more memory than sequential

==================================================
BENCHMARK SUMMARY
==================================================
DuckDB Version: v1.2.1
Configuration: 4 joins, 100000 base rows, 500000 price rows
Random Seed: 42

Single Query Approach:
  Runtime: 0.21 seconds
  Peak Memory Increase: 462.39 MB

Sequential Joins Approach:
  Runtime: 0.33 seconds
  Peak Memory Increase: 2.62 MB

Comparison (Single vs Sequential):
  Runtime: Single query is 1.61x faster than sequential
  Memory: Single query uses 176.15x more memory than sequential

Results from both approaches MATCH.
  Row-by-row comparison successful
  Hash-based comparison successful

=============================================
BENCHMARK WITH 6 JOINS
=============================================
Running benchmark with 6 joins, 100000 base rows, 500000 price rows, 1000 IDs
Using random seed: 42
Generating test data...
Running single query benchmark...
  Completed in 0.27 seconds
  Peak memory increase: 819.13 MB
Running sequential benchmark...
  Completed in 0.58 seconds
  Peak memory increase: 37.89 MB

Results match!
  Row-by-row comparison: Match
  Hash-based comparison: Match

Performance comparison:
  Runtime: Single query is 2.18x faster than sequential
  Memory: Single query uses 21.62x more memory than sequential

==================================================
BENCHMARK SUMMARY
==================================================
DuckDB Version: v1.2.1
Configuration: 6 joins, 100000 base rows, 500000 price rows
Random Seed: 42

Single Query Approach:
  Runtime: 0.27 seconds
  Peak Memory Increase: 819.13 MB

Sequential Joins Approach:
  Runtime: 0.58 seconds
  Peak Memory Increase: 37.89 MB

Comparison (Single vs Sequential):
  Runtime: Single query is 2.18x faster than sequential
  Memory: Single query uses 21.62x more memory than sequential

Results from both approaches MATCH.
  Row-by-row comparison successful
  Hash-based comparison successful

Benchmark complete. Full results saved in benchmark_results/

@pkhokhlov
Copy link
Author

@szarnyasg yes, I see the same behavior that you also reproduced above

  • 2 joins: single 22x more memory
  • 4 joins: 176x
  • 6 joins: 21x

@hawkfish
Copy link
Contributor
hawkfish commented Apr 8, 2025

Thanks for the report! This was something dumb (not releasing memory when we are done with it). Very similar to what we do in the window operator. With the fix, I measure peak memory usage with Instruments at about 100MB.

Mytherin added a commit that referenced this issue Apr 8, 2025
* Free memory right after we are done with it

fixes: #17001
fixes: duckdblabs/duckdb-internal#4586
krlmlr added a commit to duckdb/duckdb-r that referenced this issue May 15, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this issue May 15, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this issue May 16, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this issue May 16, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this issue May 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
0