Open
Description
What happened + What you expected to happen
I have 12K parquet files in a s3 prefix. When using read_parquet()
, it takes about 4 minutes to finish fetching metadata before doing anything. It looks like _fetch_metadata_parallel
is using ray.remote (processes)
It seems that threading is more appropriate for s3 in this case? On a single node, using concurrent.futures.ThreadPoolExecutor
takes about 1 minute to finish fetching metadata from all fragments. It probably requires some tuning.
I can try to implement a ParquetMetadataProvider
to test..
Related issue for read_images: #40686
Versions / Dependencies
ray 2.47.1
Reproduction script
Just ray.data.read_parquet
with large number of files.
Issue Severity
Medium: It is a significant difficulty but I can work around it.