8000 Performance Regression in LEFT JOIN with Timestamp Calculations (v1.1.1 → v1.2.0) · Issue #16552 · duckdb/duckdb · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Performance Regression in LEFT JOIN with Timestamp Calculations (v1.1.1 → v1.2.0) #16552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 tasks done
pkhokhlov opened this issue Mar 7, 2025 · 1 comment · Fixed by #16943
Closed
2 tasks done

Comments

@pkhokhlov
Copy link

What happens?

A severe performance regression (22x slower) occurs in DuckDB 1.2.0 compared to 1.1.1 when executing a LEFT JOIN that includes timestamp difference calculations. The issue appears to be related to the order in which join conditions are evaluated, with 1.2.0 evaluating expensive timestamp calculations first rather than simpler equality conditions.

To Reproduce

Minimal Reproducible Example

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import duckdb
import time

np.random.seed(42)

def generate_test_data(num_records=15000):
    base_time = datetime(2023, 1, 1, 10, 0, 0)
    
    table1_data = []
    for i in range(num_records):
        id_val = f"ID{i:06d}"
        val = np.random.randint(10, 1000)
        dt = base_time + timedelta(seconds=i*5)
        
        table1_data.append({
            'id': id_val,
            'val': val,
            'dt': dt
        })
    
    table1_df = pd.DataFrame(table1_data)
    
    # Create table2 dataframe with noise in timestamps
    table2_data = []
    for i, row in enumerate(table1_data):
        # 80% of records will have a matching record, 20% won't (to simulate LEFT JOIN)
        if np.random.random() < 0.8:
            # Add Gaussian noise to the timestamp (std dev of 30ms)
            noise_ms = np.random.normal(0, 30)
            dt_with_noise = row['dt'] + timedelta(milliseconds=noise_ms)
            
            table2_data.append({
                'id': row['id'],
                'val': row['val'],
                'dt': dt_with_noise
            })
    
    table2_df = pd.DataFrame(table2_data)
    
    return table1_df, table2_df

table1_df, table2_df = generate_test_data()

print("Table1 Data Sample:")
print(table1_df.head())
print("\nTable2 Data Sample:")
print(table2_df.head())

def test_query(conn):
    start_time = time.time()
    
    result = conn.execute("""
    WITH base AS(
        SELECT t2.dt AS t2_dt, t1.dt AS t1_dt, t1.id, t1.val, 
               ABS(epoch_ms(t2.dt - t1.dt)) AS ms_diff
        FROM table1 t1
        LEFT JOIN table2 t2 ON t2.id = t1.id
                            AND t2.val = t1.val
                            AND ABS(epoch_ms(t2.dt - t1.dt)) < 100
    )
    SELECT *
    FROM base
    WHERE t2_dt IS NULL -- checks for null records from table2
    """).fetchall()
    
    end_time = time.time()
    print(f"Query execution time: {end_time - start_time:.4f} seconds")
    print(f"Number of rows returned: {len(result)}")
    
    explain = conn.execute("""
    EXPLAIN ANALYZE
    WITH base AS(
        SELECT t2.dt AS t2_dt, t1.dt AS t1_dt, t1.id, t1.val, 
               ABS(epoch_ms(t2.dt - t1.dt)) AS ms_diff
        FROM table1 t1
        LEFT JOIN table2 t2 ON t2.id = t1.id
                            AND t2.val = t1.val
                            AND ABS(epoch_ms(t2.dt - t1.dt)) < 100
    )
    SELECT *
    FROM base
    WHERE t2_dt IS NULL -- checks for null records from table2
    """).fetchall()
    
    return result, explain

def run_test():
    # Use in-memory database
    conn = duckdb.connect(':memory:')
    
    db_version = conn.execute("SELECT version()").fetchone()[0]
    print(f"DuckDB version: {db_version}")
    
    conn.execute("CREATE TABLE table1 AS SELECT * FROM table1_df")
    conn.execute("CREATE TABLE table2 AS SELECT * FROM table2_df")
    
    results, explain = test_query(conn)
    
    conn.close()
    
    return results, explain

results, explain = run_test()

Output in DuckDB 1.1.1

Table1 Data Sample:
        id  val                    dt
0  ID000000  112 2023-01-01 10:00:00
1  ID000001  445 2023-01-01 10:00:05
2  ID000002  870 2023-01-01 10:00:10
3  ID0000
8000
03  280 2023-01-01 10:00:15
4  ID000004  116 2023-01-01 10:00:20

Table2 Data Sample:
        id  val                    dt
0  ID000002  870 2023-01-01 10:00:10.056365
1  ID000004  116 2023-01-01 10:00:20.021602
2  ID000005   81 2023-01-01 10:00:25.020553
3  ID000008  624 2023-01-01 10:00:40.003636
4  ID000009  131 2023-01-01 10:00:45.013059

DuckDB version: v1.1.1
Query execution time: 0.2993 seconds
Number of rows returned: 3001

Explain output (partial):
((val = val) AND ((id = id) AND (abs(epoch_ms((CAST(dt AS TIMESTAMP) - CAST(dt AS TIMESTAMP)))) < 100)))

Output in DuckDB 1.2.0

Table1 Data Sample:
        id  val                    dt
0  ID000000  112 2023-01-01 10:00:00
1  ID000001  445 2023-01-01 10:00:05
2  ID000002  870 2023-01-01 10:00:10
3  ID000003  280 2023-01-01 10:00:15
4  ID000004  116 2023-01-01 10:00:20

Table2 Data Sample:
        id  val                    dt
0  ID000002  870 2023-01-01 10:00:10.056365
1  ID000004  116 2023-01-01 10:00:20.021602
2  ID000005   81 2023-01-01 10:00:25.020553
3  ID000008  624 2023-01-01 10:00:40.003636
4  ID000009  131 2023-01-01 10:00:45.013059

DuckDB version: v1.2.0
Query execution time: 6.6306 seconds
Number of rows returned: 3001

Explain output (partial):
(((abs(epoch_ms((CAST(dt AS TIMESTAMP) - CAST(dt AS TIMESTAMP)))) < 100) AND (id = id)) AND (val = val))

Root Cause

The join condition evaluation order changed between versions:

  • v1.1.1 evaluates simpler conditions first (val equality, then id equality)
  • v1.2.0 evaluates the expensive timestamp calculation first for all potential row combinations

This change appears to bypass a crucial optimization where simpler conditions can filter out non-matching rows before performing expensive calculations.

Performance Impact

  • DuckDB v1.1.1: ~0.30 seconds
  • DuckDB v1.2.0: ~6.63 seconds

OS:

Linux

DuckDB Version:

1.1.1 and 1.2.0

DuckDB Client:

Python

Hardware:

same hardware used for both tests

Full Name:

Pavel Khokhlov

Affiliation:

personal

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have
@szarnyasg
Copy link
Collaborator

Thanks, I could reproduce this:

DuckDB version: v1.1.3
Query execution time: 0.1374 seconds
Number of rows returned: 3001
DuckDB version: v1.2.0
Query execution time: 1.9658 seconds
Number of rows returned: 3001

krlmlr added a commit to duckdb/duckdb-r that referenced this issue May 18, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this issue May 18, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this issue May 19, 2025
krlmlr added a commit to duckdb/duckdb-r that referenced this issue May 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants
0