Performance Regression in LEFT JOIN with Timestamp Calculations (v1.1.1 → v1.2.0)

pkhokhlov · 2025-03-07T04:02:48Z

What happens?

A severe performance regression (22x slower) occurs in DuckDB 1.2.0 compared to 1.1.1 when executing a LEFT JOIN that includes timestamp difference calculations. The issue appears to be related to the order in which join conditions are evaluated, with 1.2.0 evaluating expensive timestamp calculations first rather than simpler equality conditions.

To Reproduce

Minimal Reproducible Example

import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import duckdb
import time

np.random.seed(42)

def generate_test_data(num_records=15000):
    base_time = datetime(2023, 1, 1, 10, 0, 0)
    
    table1_data = []
    for i in range(num_records):
        id_val = f"ID{i:06d}"
        val = np.random.randint(10, 1000)
        dt = base_time + timedelta(seconds=i*5)
        
        table1_data.append({
            'id': id_val,
            'val': val,
            'dt': dt
        })
    
    table1_df = pd.DataFrame(table1_data)
    
    # Create table2 dataframe with noise in timestamps
    table2_data = []
    for i, row in enumerate(table1_data):
        # 80% of records will have a matching record, 20% won't (to simulate LEFT JOIN)
        if np.random.random() < 0.8:
            # Add Gaussian noise to the timestamp (std dev of 30ms)
            noise_ms = np.random.normal(0, 30)
            dt_with_noise = row['dt'] + timedelta(milliseconds=noise_ms)
            
            table2_data.append({
                'id': row['id'],
                'val': row['val'],
                'dt': dt_with_noise
            })
    
    table2_df = pd.DataFrame(table2_data)
    
    return table1_df, table2_df

table1_df, table2_df = generate_test_data()

print("Table1 Data Sample:")
print(table1_df.head())
print("\nTable2 Data Sample:")
print(table2_df.head())

def test_query(conn):
    start_time = time.time()
    
    result = conn.execute("""
    WITH base AS(
        SELECT t2.dt AS t2_dt, t1.dt AS t1_dt, t1.id, t1.val, 
               ABS(epoch_ms(t2.dt - t1.dt)) AS ms_diff
        FROM table1 t1
        LEFT JOIN table2 t2 ON t2.id = t1.id
                            AND t2.val = t1.val
                            AND ABS(epoch_ms(t2.dt - t1.dt)) < 100
    )
    SELECT *
    FROM base
    WHERE t2_dt IS NULL -- checks for null records from table2
    """).fetchall()
    
    end_time = time.time()
    print(f"Query execution time: {end_time - start_time:.4f} seconds")
    print(f"Number of rows returned: {len(result)}")
    
    explain = conn.execute("""
    EXPLAIN ANALYZE
    WITH base AS(
        SELECT t2.dt AS t2_dt, t1.dt AS t1_dt, t1.id, t1.val, 
               ABS(epoch_ms(t2.dt - t1.dt)) AS ms_diff
        FROM table1 t1
        LEFT JOIN table2 t2 ON t2.id = t1.id
                            AND t2.val = t1.val
                            AND ABS(epoch_ms(t2.dt - t1.dt)) < 100
    )
    SELECT *
    FROM base
    WHERE t2_dt IS NULL -- checks for null records from table2
    """).fetchall()
    
    return result, explain

def run_test():
    # Use in-memory database
    conn = duckdb.connect(':memory:')
    
    db_version = conn.execute("SELECT version()").fetchone()[0]
    print(f"DuckDB version: {db_version}")
    
    conn.execute("CREATE TABLE table1 AS SELECT * FROM table1_df")
    conn.execute("CREATE TABLE table2 AS SELECT * FROM table2_df")
    
    results, explain = test_query(conn)
    
    conn.close()
    
    return results, explain

results, explain = run_test()

Output in DuckDB 1.1.1

Table1 Data Sample:
        id  val                    dt
0  ID000000  112 2023-01-01 10:00:00
1  ID000001  445 2023-01-01 10:00:05
2  ID000002  870 2023-01-01 10:00:10
3  ID0000
8000
03  280 2023-01-01 10:00:15
4  ID000004  116 2023-01-01 10:00:20

Table2 Data Sample:
        id  val                    dt
0  ID000002  870 2023-01-01 10:00:10.056365
1  ID000004  116 2023-01-01 10:00:20.021602
2  ID000005   81 2023-01-01 10:00:25.020553
3  ID000008  624 2023-01-01 10:00:40.003636
4  ID000009  131 2023-01-01 10:00:45.013059

DuckDB version: v1.1.1
Query execution time: 0.2993 seconds
Number of rows returned: 3001

Explain output (partial):
((val = val) AND ((id = id) AND (abs(epoch_ms((CAST(dt AS TIMESTAMP) - CAST(dt AS TIMESTAMP)))) < 100)))

Output in DuckDB 1.2.0

Table1 Data Sample:
        id  val                    dt
0  ID000000  112 2023-01-01 10:00:00
1  ID000001  445 2023-01-01 10:00:05
2  ID000002  870 2023-01-01 10:00:10
3  ID000003  280 2023-01-01 10:00:15
4  ID000004  116 2023-01-01 10:00:20

Table2 Data Sample:
        id  val                    dt
0  ID000002  870 2023-01-01 10:00:10.056365
1  ID000004  116 2023-01-01 10:00:20.021602
2  ID000005   81 2023-01-01 10:00:25.020553
3  ID000008  624 2023-01-01 10:00:40.003636
4  ID000009  131 2023-01-01 10:00:45.013059

DuckDB version: v1.2.0
Query execution time: 6.6306 seconds
Number of rows returned: 3001

Explain output (partial):
(((abs(epoch_ms((CAST(dt AS TIMESTAMP) - CAST(dt AS TIMESTAMP)))) < 100) AND (id = id)) AND (val = val))

Root Cause

The join condition evaluation order changed between versions:

v1.1.1 evaluates simpler conditions first (val equality, then id equality)
v1.2.0 evaluates the expensive timestamp calculation first for all potential row combinations

This change appears to bypass a crucial optimization where simpler conditions can filter out non-matching rows before performing expensive calculations.

Performance Impact

DuckDB v1.1.1: ~0.30 seconds
DuckDB v1.2.0: ~6.63 seconds

OS:

Linux

DuckDB Version:

1.1.1 and 1.2.0

DuckDB Client:

Python

Hardware:

same hardware used for both tests

Full Name:

Pavel Khokhlov

Affiliation:

personal

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Yes, I have

The text was updated successfully, but these errors were encountered:

szarnyasg · 2025-03-07T06:59:13Z

Thanks, I could reproduce this:

DuckDB version: v1.1.3
Query execution time: 0.1374 seconds
Number of rows returned: 3001

DuckDB version: v1.2.0
Query execution time: 1.9658 seconds
Number of rows returned: 3001

Fix duckdb/duckdb#16552: adjust join condition sequence (duckdb/duckdb#16943)

pkhokhlov added the needs triage label Mar 7, 2025

szarnyasg added the reproduced label Mar 7, 2025

duckdblabs-bot removed the needs triage label Mar 7, 2025

flashmouse mentioned this issue Apr 2, 2025

Fix #16552: adjust join condition sequence #16943

Merged

Mytherin closed this as completed in #16943 May 8, 2025

Mytherin closed this as completed in e0e4cc2 May 8, 2025

krlmlr added a commit to duckdb/duckdb-r that referenced this issue May 18, 2025

vendor: Update vendored sources to duckdb/duckdb@e0e4cc2

4dae767

Fix duckdb/duckdb#16552: adjust join condition sequence (duckdb/duckdb#16943)

krlmlr added a commit to duckdb/duckdb-r that referenced this issue May 18, 2025

vendor: Update vendored sources to duckdb/duckdb@e0e4cc2

59f22c2

Fix duckdb/duckdb#16552: adjust join condition sequence (duckdb/duckdb#16943)

krlmlr added a commit to duckdb/duckdb-r that referenced this issue May 19, 2025

vendor: Update vendored sources to duckdb/duckdb@e0e4cc2

9e575c7

Fix duckdb/duckdb#16552: adjust join condition sequence (duckdb/duckdb#16943)

krlmlr added a commit to duckdb/duckdb-r that referenced this issue May 19, 2025

vendor: Update vendored sources to duckdb/duckdb@e0e4cc2

4ec32fc

Fix duckdb/duckdb#16552: adjust join condition sequence (duckdb/duckdb#16943)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Regression in LEFT JOIN with Timestamp Calculations (v1.1.1 → v1.2.0) #16552

Performance Regression in LEFT JOIN with Timestamp Calculations (v1.1.1 → v1.2.0) #16552

Uh oh!

Performance Regression in LEFT JOIN with Timestamp Calculations (v1.1.1 → v1.2.0) #16552

Performance Regression in LEFT JOIN with Timestamp Calculations (v1.1.1 → v1.2.0) #16552

Comments

What happens?

To Reproduce

Minimal Reproducible Example

Output in DuckDB 1.1.1

Output in DuckDB 1.2.0

Root Cause

Performance Impact

OS:

DuckDB Version:

DuckDB Client:

Hardware:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Uh oh!