8000 [data] Slow fetching of metadata for large number of parquet files · Issue #53995 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
[data] Slow fetching of metadata for large number of parquet files #53995
Open
@wingkitlee0

Description

@wingkitlee0

What happened + What you expected to happen

I have 12K parquet files in a s3 prefix. When using read_parquet(), it takes about 4 minutes to finish fetching metadata before doing anything. It looks like _fetch_metadata_parallel is using ray.remote (processes)

It seems that threading is more appropriate for s3 in this case? On a single node, using concurrent.futures.ThreadPoolExecutor takes about 1 minute to finish fetching metadata from all fragments. It probably requires some tuning.

I can try to implement a ParquetMetadataProvider to test..

Related issue for read_images: #40686

Versions / Dependencies

ray 2.47.1

Reproduction script

Just ray.data.read_parquet with large number of files.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething that is supposed to be working; but isn'tdataRay Data-related issuesperformancetriageNeeds triage (eg: priority, bug/not-bug, and owning component)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0