[Data] Optimize read_images on millions of path

I benchmarked read_images on 100M images. It takes several hours before any data is actually read. Here's how the time is spent:

Checking if paths have local:// scheme: 10min
Potential fix: Replace local:// scheme with explicit distributed parameter

ray/python/ray/data/read_api.py

Line 321 in 7787b2e

if paths and _is_local_scheme(paths):

Resolving paths and filesystem: 24min
Potential fix 1: Parallelize this operation
Potential fix 2: Manually resolve paths (the PyArrow implementation calls get_file_info for each path). Doing this reduced the time from 24min to 10min.

ray/python/ray/data/datasource/file_based_datasource.py

Line 383 in 7787b2e

paths, self._filesystem = _resolve_paths_and_filesystem(paths, filesystem)

Deciding which meta provider optimization to use: ~20min
Potential fix: Remove common path optimization. It might not actually be faster.

ray/python/ray/data/datasource/file_meta_provider.py

Line 422 in 7787b2e

common_path = os.path.commonpath(paths)

ray/python/ray/data/datasource/file_meta_provider.py

Line 429 in 7787b2e

) or all(str(pathlib.Path(path).parent) == common_path for path in paths):

Launching 600K tasks: 20+min
Potential fix: Change the way we determine how many tasks to launch

ray/python/ray/data/datasource/file_meta_provider.py

Lines 541 to 544 in 7787b2e

    
           for uri_chunk in np.array_split(uris, parallelism): 
        
               if len(uri_chunk) == 0: 
        
                   continue 
        
               fetch_tasks.append(remote_fetch_func.remote(uri_chunk))

Fetching actual metadata: Many days or much longer
Potential fix: Launch fewer tasks. Metadata fetching is unusually slow, and I think it's because we launch far too many tasks.

ray/python/ray/data/datasource/file_meta_provider.py

Line 545 in 7787b2e

results = metadata_fetch_bar.fetch_until_complete(fetch_tasks)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	for uri_chunk in np.array_split(uris, parallelism):
	if len(uri_chunk) == 0:
	continue
	fetch_tasks.append(remote_fetch_func.remote(uri_chunk))

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions