8000 [Data] `ArrowInvalid` during `ray.data.from_huggingface`: Parquet magic bytes not found in footer · Issue #54101 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[Data] ArrowInvalid during ray.data.from_huggingface: Parquet magic bytes not found in footer #54101
Open
@lk-chen

Description

@lk-chen

What happened + What you expected to happen

ray.data.from_huggingface doesn't load from a (seemingly properly) loaded huggingface dataset

Expect: from_huggingface should working with any datasets.dataset_dict.DatasetDict object

Exception stack
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/read_api.py", line 3231, in from_huggingface
    return read_parquet(
           ^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/read_api.py", line 946, in read_parquet
    datasource = ParquetDatasource(
                 ^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasource.py", line 241, in __init__
    pq_ds = get_parquet_dataset(paths, filesystem, dataset_kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasource.py", line 629, in get_parquet_dataset
    dataset = pq
69F1
.ParquetDataset(
              ^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 1361, in __init__
    self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/pyarrow/dataset.py", line 797, in dataset
    return _filesystem_dataset(source, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.11/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset
    return factory.finish(schema)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 3198, in pyarrow._dataset.DatasetFactory.finish
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from 'https://huggingface.co/api/datasets/abisee/cnn_dailymail/parquet/3.0.0/train/0.parquet'. Is this a 'parquet' file?: Could not open Parquet input source 'https://huggingface.co/api/datasets/abisee/cnn_dailymail/parquet/3.0.0/train/0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

It happened when I'm trying Anyscale template LLM offline batch inference with Ray Data LLM APIs

Versions / Dependencies

datasets==3.6.0
ray==2.47.1

Reproduction script

import ray 
import datasets

df = datasets.load_dataset("cnn_dailymail", "3.0.0")
print(type(df))
ds = ray.data.from_huggingface(df["train"])

Issue Severity

High: It blocks me from completing my task. Specifically I cannot finish template notebook mentioned above

Metadata

Metadata

Assignees

Labels

bugSomething that is supposed to be working; but isn'tdataRay Data-related issuestriageNeeds triage (eg: priority, bug/not-bug, and owning component)

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0