8000 [Data] Optimize `read_images` on millions of path · Issue #40686 · ray-project/ray · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
8000 Skip to content
[Data] Optimize read_images on millions of path #40686
Open
@bveeramani

Description

@bveeramani

I benchmarked read_images on 100M images. It takes several hours before any data is actually read. Here's how the time is spent:

Checking if paths have local:// scheme: 10min
Potential fix: Replace local:// scheme with explicit distributed parameter

if paths and _is_local_scheme(paths):

Resolving paths and filesystem: 24min
Potential fix 1: Parallelize this operation
Potential fix 2: Manually resolve paths (the PyArrow implementation calls get_file_info for each path). Doing this reduced the time from 24min to 10min.

paths, self._filesystem = _resolve_paths_and_filesystem(paths, filesystem)

Deciding which meta provider optimization to use: ~20min
Potential fix: Remove common path optimization. It might not actually be faster.

common_path = os.path.commonpath(paths)

) or all(str(pathlib.Path(path).parent) == common_path for path in paths):

Launching 600K tasks: 20+min
Potential fix: Change the way we determine how many tasks to launch

for uri_chunk in np.array_split(uris, parallelism):
if len(uri_chunk) == 0:
continue
fetch_tasks.append(remote_fetch_func.remote(uri_chunk))

Fetching actual metadata: Many days or much longer
Potential fix: Launch fewer tasks. Metadata fetching is unusually slow, and I think it's because we launch far too many tasks.

results = metadata_fetch_bar.fetch_until_complete(fetch_tasks)

Metadata

Metadata

Assignees

No one assigned

    Labels

    P1Issue that should be fixed within a few weeksdataRay Data-related issuesenhancementRequest for new feature and/or capability

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions

      0