You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1745090202.306819 with exception:
<class 'pyarrow.lib.ArrowTypeError'>
("Expected bytes, got a 'int' object", 'Conversion failed for column col with type object')
This comes up for example when loading Excel spreadsheets that have mixed number/string columns.
Expected behavior
Integer values should be coerced to string as part of normalization
carlpaten
changed the title
Heterogeneous columns in Pandas data frames cause crashes
Heterogeneous columns in Pandas data frames fail to extract
Apr 19, 2025
defpandas_normalize_types(df: pd.DataFrame) ->pd.DataFrame:
"""Given a Pandas dataframe with heterogeneous columns, like those produced by pd.read_excel(), coerce each columns' dtype and values to the same type"""# Hack: piggyback on the type inference of pandas to_csvtmp=io.StringIO()
df.to_csv(tmp, index=False)
tmp.seek(0)
returnpd.read_csv(tmp).convert_dtypes()
thanks for the ticket. If you want the normalizer to run, you will have to yield dictionaries. If you yield tabular data (dataframes or arrow tables) we expect the columns to be homogenous and dlt goes into a kind of optimized mode where the normalizer is skipped for the most part except for possibly adding internal columns and normalizing column names. In your example you are creating a dataframe with an "object" column type which allows for mixed type columns which will not work with dlt. We should probably improve the error message, but I strongly doubt that we will add a full normalization step for dataframes as it defeats the purpose. Is there any specific reason you can't just yield dicts?
Uh oh!
There was an error while loading. Please reload this page.
dlt version
1.9.0
Describe the problem
This comes up for example when loading Excel spreadsheets that have mixed number/string columns.
Expected behavior
Integer values should be coerced to string as part of normalization
Steps to reproduce
Operating system
Linux, macOS
Runtime environment
Local
Python version
3.12
dlt data source
N/A
dlt destination
No response
Other deployment details
Pandas version 2.2.3
Additional information
No response
The text was updated successfully, but these errors were encountered: