Open
Description
What happened + What you expected to happen
ray.data.from_huggingface
doesn't load from a (seemingly properly) loaded huggingface dataset
Expect: from_huggingface
should working with any datasets.dataset_dict.DatasetDict
object
Exception stack
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/read_api.py", line 3231, in from_huggingface
return read_parquet(
^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/read_api.py", line 946, in read_parquet
datasource = ParquetDatasource(
^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasource.py", line 241, in __init__
pq_ds = get_parquet_dataset(paths, filesystem, dataset_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/ray/data/_internal/datasource/parquet_datasource.py", line 629, in get_parquet_dataset
dataset = pq
69F1
.ParquetDataset(
^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/pyarrow/parquet/core.py", line 1361, in __init__
self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/pyarrow/dataset.py", line 797, in dataset
return _filesystem_dataset(source, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.11/site-packages/pyarrow/dataset.py", line 486, in _filesystem_dataset
return factory.finish(schema)
^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 3198, in pyarrow._dataset.DatasetFactory.finish
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Error creating dataset. Could not read schema from 'https://huggingface.co/api/datasets/abisee/cnn_dailymail/parquet/3.0.0/train/0.parquet'. Is this a 'parquet' file?: Could not open Parquet input source 'https://huggingface.co/api/datasets/abisee/cnn_dailymail/parquet/3.0.0/train/0.parquet': Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
It happened when I'm trying Anyscale template LLM offline batch inference with Ray Data LLM APIs
Versions / Dependencies
datasets==3.6.0
ray==2.47.1
Reproduction script
import ray
import datasets
df = datasets.load_dataset("cnn_dailymail", "3.0.0")
print(type(df))
ds = ray.data.from_huggingface(df["train"])
Issue Severity
High: It blocks me from completing my task. Specifically I cannot finish template notebook mentioned above