8000 Heterogeneous columns in Pandas data frames fail to extract · Issue #2536 · dlt-hub/dlt · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Heterogeneous columns in Pandas data frames fail to extract #2536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
carlpaten opened this issue Apr 19, 2025 · 3 comments
Closed

Heterogeneous columns in Pandas data frames fail to extract #2536

carlpaten opened this issue Apr 19, 2025 · 3 comments
Assignees
Labels
question Further information is requested wontfix This will not be worked on

Comments

@carlpaten
Copy link
carlpaten commented Apr 19, 2025

dlt version

1.9.0

Describe the problem

dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1745090202.306819 with exception:

<class 'pyarrow.lib.ArrowTypeError'>
("Expected bytes, got a 'int' object", 'Conversion failed for column col with type object')

This comes up for example when loading Excel spreadsheets that have mixed number/string columns.

Expected behavior

Integer values should be coerced to string as part of normalization

Steps to reproduce

import dlt
import pandas as pd

df = pd.DataFrame({"col": ["str", 1]})
pipeline = dlt.pipeline(destination="bigquery")
pipeline.extract(df, table_name="test")

Operating system

Linux, macOS

Runtime environment

Local

Python version

3.12

dlt data source

N/A

dlt destination

No response

Other deployment details

Pandas version 2.2.3

Additional information

No response

@carlpaten carlpaten changed the title Heterogeneous columns in Pandas data frames cause crashes Heterogeneous columns in Pandas data frames fail to extract Apr 19, 2025
@carlpaten
Copy link
Author

Hacky fix: run the data frame through

def pandas_normalize_types(df: pd.DataFrame) -> pd.DataFrame:
    """Given a Pandas dataframe with heterogeneous columns, like those produced by pd.read_excel(), coerce each columns' dtype and values to the same type"""
    # Hack: piggyback on the type inference of pandas to_csv
    tmp = io.StringIO()
    df.to_csv(tmp, index=False)
    tmp.seek(0)
    return pd.read_csv(tmp).convert_dtypes()

@sh-rp
Copy link
Collaborator
sh-rp commented Apr 22, 2025

Hey @carlpaten,

thanks for the ticket. If you want the normalizer to run, you will have to yield dictionaries. If you yield tabular data (dataframes or arrow tables) we expect the columns to be homogenous and dlt goes into a kind of optimized mode where the normalizer is skipped for the most part except for possibly adding internal columns and normalizing column names. In your example you are creating a dataframe with an "object" column type which allows for mixed type columns which will not work with dlt. We should probably improve the error message, but I strongly doubt that we will add a full normalization step for dataframes as it defeats the purpose. Is there any specific reason you can't just yield dicts?

@sh-rp sh-rp added the wontfix This will not be worked on label Apr 23, 2025
@rudolfix rudolfix added the question Further information is requested label Apr 23, 2025
@sh-rp
Copy link
Collaborator
sh-rp commented May 2, 2025

closing for no activity

@sh-rp sh-rp closed this as completed May 2, 2025
@github-project-automation github-project-automation bot moved this from Todo to Done in dlt core library May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested wontfix This will not be worked on
Projects
Status: Done
Development

No branches or pull requests

3 participants
0