8000 Cannot extract MySQL datetime value 0000-00-00 00:00:00 with pyarrow or connectorx · Issue #2699 · dlt-hub/dlt · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Cannot extract MySQL datetime value 0000-00-00 00:00:00 with pyarrow or connectorx #2699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
tramjoe opened this issue May 30, 2025 · 1 comment
Assignees
Labels
question Further information is requested

Comments

@tramjoe
Copy link
tramjoe commented May 30, 2025

dlt version

1.11.0

Describe the problem

We have MySQL 5.7.24 tables containing some datetime values set to '0000-00-00 00:00:00'. This seems to be a MySQL specific thing when columns are not nullables and strict checking is not enabled. I assume with naive type coercion expecting '1970-01-01 00:00:00' as the sensible min() value, this results in a negative / invalid timestamp.

However, a quick internet search indicates that this appears to be a known thing, already accounted for by many tools.

Using backend set to either 'sqlalchemy' or 'pandas', dlt does just fine, and I get the correct ;1970-01-01 00:00:00' in my destination (clickhouse, not that it's important for the issue, the problem appears during extraction).

Using 'pyarrow' as a backend, I am getting:

Traceback (most recent call last):
  File "/root/test_dlt/cgrates_cdrs_pipeline.py", line 124, in <module>
    load_standalone_table_resource()
  File "/root/test_dlt/cgrates_cdrs_pipeline.py", line 57, in load_standalone_table_resource
    info = pipeline.extract([cgrates_cdrs], write_disposition="append", workers=8)
  File "/usr/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 222, in _wrap
    step_info = f(self, *args, **kwargs)
  File "/usr/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 176, in _wrap
    rv = f(self, *args, **kwargs)
  File "/usr/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 162, in _wrap
    return f(self, *args, **kwargs)
  File "/usr/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 271, in _wrap
    return f(self, *args, **kwargs)
  File "/usr/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 489, in extract
    raise PipelineStepFailed(
dlt.pipeline.exceptions.PipelineStepFailed: Pipeline execution failed at stage extract when processing package 1748586785.3124416 with exception:

<class 'dlt.extract.exceptions.ResourceExtractionError'>
In processing pipe cdrs_20250517: extraction of resource cdrs_20250517 in generator table_rows caused an exception: Conversion to arrow failed for field `answer_time` with dlt hint `data_type=timestamp` and `inferred_arrow_type=timestamp[us]` This data type seems currently unsupported by dlt. Please open a GitHub issue

Using connectorx as a backend, the error is:

thread '<unnamed>' panicked at /github/home/.cargo/registry/src/index.crates.io-6f17d22bba15001f/mysql_common-0.29.2/src/value/convert/mod.rs:175:23:
Could not retrieve chrono::naive::datetime::NaiveDateTime from Value
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/root/test_dlt/cgrates_cdrs_pipeline.py", line 124, in <module>
    load_standalone_table_resource()
  File "/root/test_dlt/cgrates_cdrs_pipeline.py", line 57, in load_standalone_table_resource
    info = pipeline.extract([cgrates_cdrs], write_disposition="append", workers=8)
  File "/usr/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 222, in _wrap
    step_info = f(self, *args, **kwargs)
  File "/usr/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 176, in _wrap
    rv = f(self, *args, **kwargs)
  File "/usr/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 162, in _wrap
    return f(self, *args, **kwargs)
  File "/usr/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 271, in _wrap
    return f(self, *args, **kwargs)
  File "/usr/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 468, in extract
    self._extract_source(
  File "/usr/lib/python3.9/site-packages/dlt/pipeline/pipeline.py", line 1254, in _extract_source
    load_id = extract.extract(
  File "/usr/lib/python3.9/site-packages/dlt/extract/extract.py", line 457, in extract
    self._extract_single_source(
  File "/usr/lib/python3.9/site-packages/dlt/extract/extract.py", line 380, in _extract_single_source
    for pipe_item in pipes:
  File "/usr/lib/python3.9/site-packages/dlt/extract/pipe_iterator.py", line 162, in __next__
    pipe_item = self._get_source_item()
  File "/usr/lib/python3.9/site-packages/dlt/extract/pipe_iterator.py", line 277, in _get_source_item
    pipe_item = next(gen)
  File "/usr/lib/python3.9/site-packages/dlt/sources/sql_database/helpers.py", line 303, in table_rows
    yield from loader.load_rows(backend_kwargs)
  File "/usr/lib/python3.9/site-packages/dlt/sources/sql_database/helpers.py", line 176, in load_rows
    yield from self._load_rows_connectorx(query, backend_kwargs)
  File "/usr/lib/python3.9/site-packages/dlt/sources/sql_database/helpers.py", line 234, in _load_rows_connectorx
    df = cx.read_sql(conn, query_str, **backend_kwargs)
  File "/usr/lib/python3.9/site-packages/connectorx/__init__.py", line 386, in read_sql
    result = _read_sql(
pyo3_runtime.PanicException: Could not retrieve chrono::naive::datetime::NaiveDateTime from Value

Expected behavior

Coerce the 0000-00-00 00:00:00 to 1970-01-01 00:00:00 or let the destination do it.

With pandas, it seems the later is the default behavior:

2025-05-30 06:50:44,824|[WARNING]|1749067|139738012964672|dlt|extractors.py|_compute_tables:487|In resource: cdrs_20250517, when merging arrow schema with dlt schema, several column hints were different. dlt schema hints were kept and arrow schema and data were unmodified. It is up to destination to coerce the differences when loading. Change log level to INFO for more details.

With sqlalchemy, it just works as expected.

Steps to reproduce

  • Need to have a source MySQL table with a datetime NOT NULL column
  • Use an old enough MySQL that allows zero dates or which has NO_ZERO_DATE mode not set (deprecated in 8.4 I believe)
  • Not have strict mode enabled if the above applies
  • Try to extract that table using the backends mentioned above (pyarrow, connectorx).

Operating system

Linux

Runtime environment

Virtual Machine

Python version

3.9

dlt data source

sql_database sing msql+pymysql driver

dlt destination

clickhouse 25.2.2.39

Other deployment details

No response

Additional information

I am quite new to dlt, started evaluating just yesterday in the hope to replace my own custom python ode with it for our extract and load task from various source to clickhouse, so bear with me for being a beginner.

Is there a workaround to allow thoses backends to work?
Is there any more testing I can do to help?

@tramjoe tramjoe changed the title Cannot load MySQL datetime value 0000-00-00 00:00:00 with pyarrow or connectorx Cannot extract MySQL datetime value 0000-00-00 00:00:00 with pyarrow or connectorx May 30, 2025
@rudolfix
Copy link
Collaborator
rudolfix commented Jun 2, 2025

@tramjoe looks like connector-x bug. we are investigating several related bugs with wrong date-time handling. there's a new connector-x version where arrow2 backend was removed. maybe you can try it with arrow backend? check here for more info: #2661 (comment)

tldr;> this is connector-x bug so if it is not fixed in the newest version you need to file bug report there....

@rudolfix rudolfix moved this from Todo to In Progress in dlt core library Jun 2, 2025
@rudolfix rudolfix added the question Further information is requested label Jun 2, 2025
@rudolfix rudolfix self-assigned this Jun 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: In Progress
Development

No branches or pull requests

2 participants
0