8000 kerchunk reference file and Zarr groups · Issue #556 · fsspec/kerchunk · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

kerchunk reference file and Zarr groups #556

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
8000
keltonhalbert opened this issue Apr 22, 2025 · 8 comments
Open

kerchunk reference file and Zarr groups #556

keltonhalbert opened this issue Apr 22, 2025 · 8 comments

Comments

@keltonhalbert
Copy link
Contributor
keltonhalbert commented Apr 22, 2025

Hello,

I was previously storing the output of scan_grib using Zarr groups in order to have an entire grib2 file stored in single json reference file. This was so I could use xarray to read from single file, passing in the open_dataset_options={"group": "my_group"} to get the portion of the file on a common contiguous coordinate I desired.

It appears that in the last year, through various updates of kerchunk, something got changed or broken along the way, and I can no longer read reference JSONs with a Zarr group. I don't know how much good it is, but I have an example reference file attached.

I attempted opening it with:

import xarray as xr
ds = xr.open_dataset("./hrrr-hybrid.json", engine="kerchunk", open_dataset_options={"group": "native"})

Previously, this would successfully open the group, but as of now, returns an empty dataset. Any suggestions about what I may be doing wrong? Is this functionality that's no longer supported, or is this a bug?

hrrr-hybrid.json

@keltonhalbert
Copy link
Contributor Author

Narrowing this down a bit further, If I downgrade Zarr and Kerchunk to the following versions, the desired behavior returns:

kerchunk == 0.2.7
zarr == 2.18.7

@martindurant
Copy link
Member

If I downgrade Zarr and Kerchunk to the following versions

Do you mean that the existing reference file opens, or only if you remake it with the older version of kerchunk?

@keltonhalbert
Copy link
Contributor Author

@martindurant apologies for the lack of clarity -- I was intending to say that the existing reference file opens.

If it helps, I essentially tack on the group information to the reference keys after they are returned from scan_grib... and as far as I can tell with the recent changes to kerchunk, it should still be using the Zarr v2 spec? I'm not well versed in what's changed with the Zarr spec for v3, but my intuition is that v2 should still work.

@martindurant
Copy link
Member

it should still be using the Zarr v2 spec

Yes - it runs under zarr3, but produces v2 metadata, which will continue to be supported.

@keltonhalbert
Copy link
Contributor Author

Well in that case, this reference file should theoretically still work then. Not entirely sure where to start other than inside the xarray back end and work from there. No errors or warnings show up in the console when passing the open_group info, it just returns an empty dataset. I'll try to do some deeper digging and figure out what changed.

@martindurant
Copy link
Member

Writing it out longhand and chosing one particular group:

fs = fsspec.filesystem("asyncwrapper", fs=fsspec.filesystem("file"), asynchronous=True)
ds = xr.open_dataset("reference://native", engine='zarr', backend_kwargs={"storage_options": {"fo": "/Users/mdurant/Downloads/hrrr-hybrid.json", "fs": fs}, "consolidated": False})
ds

gives

<xarray.Dataset> Size: 5GB
Dimensions:     (hybrid: 50, y: 1059, x: 1799)
Coordinates:
  * hybrid      (hybrid) float64 400B 1.0 2.0 3.0 4.0 ... 47.0 48.0 49.0 50.0
    latitude    (y, x) float64 15MB ...
    longitude   (y, x) float64 15MB ...
    valid_time  (hybrid) datetime64[ns] 400B ...
Dimensions without coordinates: y, x
Data variables:
    pres        (hybrid, y, x) float64 762MB ...
    gh          (hybrid, y, x) float64 762MB ...
    t           (hybrid, y, x) float64 762MB ...
    q           (hybrid, y, x) float64 762MB ...
    u           (hybrid, y, x) float64 762MB ...
    v           (hybrid, y, x) float64 762MB ...
    w           (hybrid, y, x) float64 762MB ...
Attributes:
    GRIB_edition:            2
    GRIB_centre:             kwbc
    GRIB_centreDescription:  US National Weather Service - NCEP
    GRIB_subCentre:          0
    institution:             US National Weather Service - NCEP

To actually get data out, you also need remote_options (asynchronous=True) and remote_protocol ("https").

@nishadhka
Copy link

Just checking with similar issue in reading json and paraquet refence files in zarr v3, the following issue with fsspec

>>> import fsspec
>>> fsspec.__version__
>>> '2025.3.2'
>>> import xarray as xr
>>> import dask
>>> import zarr
>>> print("xarray:  ", xr.__version__)
xarray:   2025.4.0
>>> print("dask:    ", dask.__version__)
dask:     2025.4.1
>>> print("zarr:    ", zarr.__version__)
zarr:     3.0.7

>>> fs = fsspec.filesystem("asyncwrapper", fs=fsspec.filesystem("file"), asynchronous=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/runner/workspace/.pythonlibs/lib/python3.11/site-packages/fsspec/registry.py", line 309, in filesystem
    cls = get_filesystem_class(protocol)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/runner/workspace/.pythonlibs/lib/python3.11/site-packages/fsspec/registry.py", line 246, in get_filesystem_class
    raise ValueError(f"Protocol not known: {protocol}")
ValueError: Protocol not known: asyncwrapper
>>> 

But however running following could able to open the json

import xarray as xr
import fsspec
import importlib

# Try to locate the AsyncFileSystemWrapper class
try:
    # Try different potential import paths
    for path in [
        "fsspec.implementations.asyn",
        "fsspec.implementations.asyn_wrapper",
        "fsspec.asyn",
        "fsspec.asyn_wrapper"
    ]:
        try:
            module = importlib.import_module(path)
            if hasattr(module, "AsyncFileSystemWrapper"):
                AsyncWrapper = module.AsyncFileSystemWrapper
                break
        except ImportError:
            continue
    else:
        # If we couldn't find it, use AsyncFileSystem as fallback
        from fsspec.asyn import AsyncFileSystem as AsyncWrapper
except ImportError:
    # Last resort fallback
    from fsspec.asyn import AsyncFileSystem as AsyncWrapper

# Create the wrapped filesystem
file_fs = fsspec.filesystem("file")
fs = AsyncWrapper(fs=file_fs, asynchronous=True)

# Use it with xarray
ds = xr.open_dataset(
    "reference://native", 
    engine='zarr', 
    backend_kwargs={
        "storage_options": {
            "fo": "test_references/hrrr-hybrid.json", 
            "fs": fs,
            "consolidated": False
        }
    }
)

with ends up in following warnign and opening of ds

<stdin>:1: RuntimeWarning: Failed to open Zarr store with consolidated metadata, but successfully read with non-consolidated metadata. This is typically much slower for opening a dataset. To silence this warning, consider:
1. Consolidating metadata in this existing store with zarr.consolidate_metadata().
2. Explicitly setting consolidated=False, to avoid trying to read consolidate metadata, or
3. Explicitly setting consolidated=True, to raise an error in this case instead of falling back to try reading non-consolidated metadata.
>>> ds
<xarray.Dataset> Size: 5GB
Dimensions:     (hybrid: 50, y: 1059, x: 1799)
Coordinates:
  * hybrid      (hybrid) float64 400B 1.0 2.0 3.0 4.0 ... 47.0 48.0 49.0 50.0
    latitude    (y, x) float64 15MB ...
    longitude   (y, x) float64 15MB ...
    valid_time  (hybrid) datetime64[ns] 400B ...
Dimensions without coordinates: y, x
Data variables:
    pres        (hybrid, y, x) float64 762MB ...
    gh          (hybrid, y, x) float64 762MB ...
    t           (hybrid, y, x) float64 762MB ...
    q           (hybrid, y, x) float64 762MB ...
    u           (hybrid, y, x) float64 762MB ...
    v           (hybrid, y, x) float64 762MB ...
    w           (hybrid, y, x) float64 762MB ...
Attributes:
    GRIB_edition:            2
    GRIB_centre:             kwbc
    GRIB_centreDescription:  US National Weather Service - NCEP
    GRIB_subCentre:          0
    institution:             US National Weather Service - NCEP

However following open without any warning

import xarray as xr
import fsspec
from fsspec.implementations.reference import ReferenceFileSystem

# Create a regular local filesystem with asynchronous=True
fs = fsspec.filesystem("file", asynchronous=True)

# Use it with xarray
ds = xr.open_dataset(
    "reference://native", 
    engine='zarr', 
    backend_kwargs={
        "storage_options": {
            "fo": "test_references/hrrr-hybrid.json", 
            "fs": fs, 
            "remote_protocol": "file",  # Specify the protocol for remote files
            "remote_options": {"asynchronous": True}  # Ensure this matches fs
        }, 
        "consolidated": False
    }
)
>>> ds
<xarray.Dataset> Size: 5GB
Dimensions:     (hybrid: 50, y: 1059, x: 1799)
Coordinates:
  * hybrid      (hybrid) float64 400B 1.0 2.0 3.0 4.0 ... 47.0 48.0 49.0 50.0
    latitude    (y, x) float64 15MB ...
    longitude   (y, x) float64 15MB ...
    valid_time  (hybrid) datetime64[ns] 400B ...
Dimensions without coordinates: y, x
Data variables:
    pres        (hybrid, y, x) float64 762MB ...
    gh          (hybrid, y, x) float64 762MB ...
    t           (hybrid, y, x) float64 762MB ...
    q           (hybrid, y, x) float64 762MB ...
    u           (hybrid, y, x) float64 762MB ...
    v           (hybrid, y, x) float64 762MB ...
    w           (hybrid, y, x) float64 762MB ...
Attributes:
    GRIB_edition:            2
    GRIB_centre:             kwbc
    GRIB_centreDescription:  US National Weather Service - NCEP
    GRIB_subCentre:          0
    institution:             US National Weather Service - NCEP

@martindurant
Copy link
Member

ValueError: Protocol not known: asyncwrapper

I think it's not yet released. It was previously called "async_wrapper", but it turns out URL protocol strings should not have a "_" in.

"consolidated": False

Yes, this is annoying. kerchunk effectively does a different version of consolidation. It could provide a .zmetadata I suppose.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0