Parquet nullable strings converted to JSON

What happens?

with duckdb 1.3.0 we have issues when reading multiple parquet files due to auto converting of null values to JSON columns. basically some parquet files just end up with JSON columns due to empty columns. when trying duckdb 1.3.0 we get the following error:

the column "memberDn" has type VARCHAR, but we are trying to read it as type JSON.
  This can happen when reading multiple Parquet files. The schema information is taken from the first Parquet file by default.

in duckdb 1.2.2 this results in a JSON column without an error.

To Reproduce

test1.json

[
    {
        "groupDn":  "TEST 1",
        "memberDn":  null
    },
    {
        "groupDn":  "TEST 2",
        "memberDn":  null
    },
    {
        "groupDn":  "TEST 3",
        "memberDn":  null
    }
]

test2.json

[
    {
        "groupDn": "TEST 1",
        "memberDn": "a"
    },
    {
        "groupDn": "TEST 2",
        "memberDn": "b"
    },
    {
        "groupDn": "TEST 3",
        "memberDn": "c"
    }
]

copy (select * from 'test1.json') to test1.parquet;
copy (select * from 'test2.json') to test2.parquet;
select * from read_parquet('test*.parquet');

OS:

ubuntu 22.4

DuckDB Version:

1.3.0 previews from the last couple of weeks (latest as well)

DuckDB Client:

python and cli

Hardware:

No response

Full Name:

Daniel Gut

Affiliation:

Aveniq

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

No - I cannot share the data sets because they are confidential

Did you include all code required to reproduce the issue?

Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Yes, I have

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Hardware:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

What happens?

To Reproduce

OS:

DuckDB Version:

DuckDB Client:

Hardware:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions