Heterogeneous columns in Pandas data frames fail to extract #2536

carlpaten · 2025-04-19T19:19:37Z

dlt version

1.9.0

Describe the problem

dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1745090202.306819 with exception:

<class 'pyarrow.lib.ArrowTypeError'>
("Expected bytes, got a 'int' object", 'Conversion failed for column col with type object')

This comes up for example when loading Excel spreadsheets that have mixed number/string columns.

Expected behavior

Integer values should be coerced to string as part of normalization

Steps to reproduce

import dlt
import pandas as pd

df = pd.DataFrame({"col": ["str", 1]})
pipeline = dlt.pipeline(destination="bigquery")
pipeline.extract(df, table_name="test")

Operating system

Linux, macOS

Runtime environment

Local

Python version

3.12

dlt data source

N/A

dlt destination

No response

Other deployment details

Pandas version 2.2.3

Additional information

No response

The text was updated successfully, but these errors were encountered:

carlpaten · 2025-04-19T20:09:08Z

Hacky fix: run the data frame through

def pandas_normalize_types(df: pd.DataFrame) -> pd.DataFrame:
    """Given a Pandas dataframe with heterogeneous columns, like those produced by pd.read_excel(), coerce each columns' dtype and values to the same type"""
    # Hack: piggyback on the type inference of pandas to_csv
    tmp = io.StringIO()
    df.to_csv(tmp, index=False)
    tmp.seek(0)
    return pd.read_csv(tmp).convert_dtypes()

sh-rp · 2025-04-22T06:18:14Z

Hey @carlpaten,

thanks for the ticket. If you want the normalizer to run, you will have to yield dictionaries. If you yield tabular data (dataframes or arrow tables) we expect the columns to be homogenous and dlt goes into a kind of optimized mode where the normalizer is skipped for the most part except for possibly adding internal columns and normalizing column names. In your example you are creating a dataframe with an "object" column type which allows for mixed type columns which will not work with dlt. We should probably improve the error message, but I strongly doubt that we will add a full normalization step for dataframes as it defeats the purpose. Is there any specific reason you can't just yield dicts?

sh-rp · 2025-05-02T08:14:34Z

closing for no activity

github-project-automation bot added this to dlt core library Apr 19, 2025

github-project-automation bot moved this to Todo in dlt core library Apr 19, 2025

carlpaten changed the title ~~Heterogeneous columns in Pandas data frames cause crashes~~ Heterogeneous columns in Pandas data frames fail to extract Apr 19, 2025

sh-rp added the wontfix This will not be worked on label Apr 23, 2025

rudolfix added the question Further information is requested label Apr 23, 2025

rudolfix assigned sh-rp Apr 23, 2025

sh-rp closed this as completed May 2, 2025

github-project-automation bot moved this from Todo to Done in dlt core library May 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Heterogeneous columns in Pandas data frames fail to extract #2536

Heterogeneous columns in Pandas data frames fail to extract #2536

Uh oh!

Uh oh!

Uh oh!

Heterogeneous columns in Pandas data frames fail to extract #2536

Heterogeneous columns in Pandas data frames fail to extract #2536

Comments

Uh oh!

dlt version

Describe the problem

Expected behavior

Steps to reproduce

Operating system

Runtime environment

Python version

dlt data source

dlt destination

Other deployment details

Additional information

Uh oh!

Uh oh!

Uh oh!