[data] Slow fetching of metadata for large number of parquet files

What happened + What you expected to happen

I have 12K parquet files in a s3 prefix. When using read_parquet(), it takes about 4 minutes to finish fetching metadata before doing anything. It looks like _fetch_metadata_parallel is using ray.remote (processes)

It seems that threading is more appropriate for s3 in this case? On a single node, using concurrent.futures.ThreadPoolExecutor takes about 1 minute to finish fetching metadata from all fragments. It probably requires some tuning.

I can try to implement a ParquetMetadataProvider to test..

Versions / Dependencies

ray 2.47.1

Reproduction script

Just ray.data.read_parquet with large number of files.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions