Open
Description
What happens?
with duckdb 1.3.0 we have issues when reading multiple parquet files due to auto converting of null values to JSON columns. basically some parquet files just end up with JSON columns due to empty columns. when trying duckdb 1.3.0 we get the following error:
the column "memberDn" has type VARCHAR, but we are trying to read it as type JSON.
This can happen when reading multiple Parquet files. The schema information is taken from the first Parquet file by default.
in duckdb 1.2.2 this results in a JSON column without an error.
To Reproduce
test1.json
[
{
"groupDn": "TEST 1",
"memberDn": null
},
{
"groupDn": "TEST 2",
"memberDn": null
},
{
"groupDn": "TEST 3",
"memberDn": null
}
]
test2.json
[
{
"groupDn": "TEST 1",
"memberDn": "a"
},
{
"groupDn": "TEST 2",
"memberDn": "b"
},
{
"groupDn": "TEST 3",
"memberDn": "c"
}
]
copy (select * from 'test1.json') to test1.parquet;
copy (select * from 'test2.json') to test2.parquet;
select * from read_parquet('test*.parquet');
OS:
ubuntu 22.4
DuckDB Version:
1.3.0 previews from the last couple of weeks (latest as well)
DuckDB Client:
python and cli
Hardware:
No response
Full Name:
Daniel Gut
Affiliation:
Aveniq
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a nightly build
Did you include all relevant data sets for reproducing the issue?
No - I cannot share the data sets because they are confidential
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have