Description
I benchmarked read_images
on 100M images. It takes several hours before any data is actually read. Here's how the time is spent:
Checking if paths have local://
scheme: 10min
Potential fix: Replace local://
scheme with explicit distributed
parameter
ray/python/ray/data/read_api.py
Line 321 in 7787b2e
Resolving paths and filesystem: 24min
Potential fix 1: Parallelize this operation
Potential fix 2: Manually resolve paths (the PyArrow implementation calls get_file_info
for each path). Doing this reduced the time from 24min to 10min.
Deciding which meta provider optimization to use: ~20min
Potential fix: Remove common path optimization. It might not actually be faster.
Launching 600K tasks: 20+min
Potential fix: Change the way we determine how many tasks to launch
ray/python/ray/data/datasource/file_meta_provider.py
Lines 541 to 544 in 7787b2e
Fetching actual metadata: Many days or much longer
Potential fix: Launch fewer tasks. Metadata fetching is unusually slow, and I think it's because we launch far too many tasks.