10000 use `orjson` to store metainfo package by lukaspie · Pull Request #613 · FAIRmat-NFDI/pynxtools · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

use orjson to store metainfo package #613

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 16 commits into
base: master
Choose a base branch
from
Open

use orjson to store metainfo package #613

wants to merge 16 commits into from

Conversation

lukaspie
Copy link
Collaborator
@lukaspie lukaspie commented Apr 4, 2025

No description provided.

@lukaspie lukaspie changed the title use orjson to store metainfo package use orjson to store metainfo package Apr 4, 2025
@lukaspie
Copy link
Collaborator Author
lukaspie commented Apr 7, 2025

When loading the pynxtools metainfo package from the stored file src/pynxtools/nomad/nxs_metainfo_package.json (which is autogenerated from running the schema without re-loading), it looks like the metainfo is correctly instantiated.

However, in parsing, there is an error along the lines of There is no subsection to hold a pynxtools.nomad.schema.<section-name>:Section in pynxtools.nomad.schema.<section-parent-name>:Section. For the specific debugging I am using (see launch.json, I get the error There is no subsection to hold a pynxtools.nomad.schema.Arpes__ENTRY:Section in pynxtools.nomad.schema.Arpes:Section. This is the first node in the NeXus files that's parsed. Interestingly, additional attributes that are in the NeXus file (in the NeXus base class NXroot) are properly parsed, so something is working.

For the error above, it basically comes from m_create function call in src/pynxtools/nomad/parser.py, specifically from this line. Interestingly, in debug mode, self.m_def.all_sub_sections_by_section does contain the value of section_def (which is pynxtools.nomad.schema.Arpes__ENTRY:Section) as a key, see the screenshot.

So basically, my main questions are:

  1. Why does the get call not resolve even though the section is in the all_sub_sections_by_section?
  2. What's different about instantiating the schema from scratch vs. loading the package and initializing the metainfo afterwards? If I do the former, everything looks exactly the same in debug mode, but the get call keeps properly resolved.

image

@lauri-codes @blueraft @DanielYang59 any suggestions why this may be the case? cc @sanbrock

@lukaspie
Copy link
Collaborator Author
lukaspie commented Apr 8, 2025

Problem is that these two objects are not exactly the same. In debug mode,

id(list(self.m_def.all_sub_sections_by_section.keys())[-1])
>> 140248739911328
id(section_def)
>> 140248775424816

On the other hand, if we generate the schema from scratch, they are indeed the same object:

id(list(self.m_def.all_sub_sections_by_section.keys())[-1])
>> 140336944267776
id(section_def)
>> 140336944267776

So it seems that somewhere in the serialization and reloading of the schema the memory address changes, but I don't understand this well enough.

We reproduced the same error when using a small test json file that only contains the section definitions needed for reproducing the error, see attached. In order to check this schema, you need to replace the name in the load_nexus_schema function in the schema.py file.

test.json

@blueraft
Copy link
blueraft commented Apr 8, 2025

I can also reproduce this! cc: @TLCFEM since this is something related to the metainfo.

@TLCFEM
Copy link
Contributor
TLCFEM commented Apr 8, 2025

Try set cached=False in this line: https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/blame/develop/nomad/metainfo/metainfo.py?ref_type=heads#L4394
Report back the result to proceed further.

@lukaspie
Copy link
Collaborator Author
lukaspie commented Apr 9, 2025

Try set cached=False in this line: gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/blame/develop/nomad/metainfo/metainfo.py?ref_type=heads#L4394 Report back the result to proceed further.

I still get the same issue, the ids do not match. I also tried cached=False for all_sub_sections since that is called in all_sub_sections_by_section, but the result was the same.

@TLCFEM
Copy link
Contributor
TLCFEM commented Apr 9, 2025

@lukaspie Please apply and check if everything works.
https://gitlab.mpcdf.mpg.de/nomad-lab/nomad-FAIR/-/merge_requests/2430

@lukaspie lukaspie marked this pull request as ready for review April 9, 2025 13:37
@TLCFEM
Copy link
Contributor
TLCFEM commented Apr 9, 2025

Long story short, m_from_dict utilise the SectionReference to resolve the references section, this yields a MProxy that mimics the behaviour of a Section.
It contains everything the target Section contains but they are different objects in memory.

@blueraft
Copy link

With this PR, the next pynxtools release would be incompatible with earlier versions of nomad right? Since pynxtools doesn't declare nomad as a direct dependency (not sure why?), this might result in some broken installations if they're not using the latest nomad version.

@lukaspie
Copy link
Collaborator Author

With this PR, the next pynxtools release would be incompatible with earlier versions of nomad right? Since pynxtools doesn't declare nomad as a direct dependency (not sure why?), this might result in some broken installations if they're not using the latest nomad version.

I think the original idea was that pynxtools can be used standalone (for data conversion), so it was meant to be a lightweight install outside of NOMAD. I think after pluginization, we should have probably added nomad as explicit dependency though.

In any case, we can probably only publish after the next nomad release anyway, right?

@rettigl
Copy link
Collaborator
rettigl commented Apr 10, 2025

With this PR, the next pynxtools release would be incompatible with earlier versions of nomad right? Since pynxtools doesn't declare nomad as a direct dependency (not sure why?), this might result in some broken installations if they're not using the latest nomad version.

I think the original idea was that pynxtools can be used standalone (for data conversion), so it was meant to be a lightweight install outside of NOMAD. I think after pluginization, we should have probably added nomad as explicit dependency though.

In any case, we can probably only publish after the next nomad release anyway, right?

I would argue we still should keep this independence. There should definitely be a standalone installation without nomad possible. We could create an optional dependency group, e.g. [nomad] that defines the required nomad version, though.

@lukaspie
Copy link
Collaborator Author

With this PR, the next pynxtools release would be incompatible with earlier versions of nomad right? Since pynxtools doesn't declare nomad as a direct dependency (not sure why?), this might result in some broken installations if they're not using the latest nomad version.

I think the original idea was that pynxtools can be used standalone (for data conversion), so it was meant to be a lightweight install outside of NOMAD. I think after pluginization, we should have probably added nomad as explicit dependency though.
In any case, we can probably only publish after the next nomad release anyway, right?

I would argue we still should keep this independence. There should definitely be a standalone installation without nomad possible. We could create an optional dependency group, e.g. [nomad] that defines the required nomad version, though.

So, we could have a [nomad] extra in pynxtools pyproject.toml and then in nomad-distro, we would use pynxtools[nomad]==<some-version> to ensure compatibility?

If I remember correctly, there were problems when using optional dependencies (like we had before with [convert]) and then running using uv pip compile like here, but maybe this is different.

@blueraft
Copy link

With this PR, the next pynxtools release would be incompatible with earlier versions of nomad right? Since pynxtools doesn't declare nomad as a direct dependency (not sure why?), this might result in some broken installations if they're not using the latest nomad version.

I think the original idea was that pynxtools can be used standalone (for data conversion), so it was meant to be a lightweight install outside of NOMAD. I think after pluginization, we should have probably added nomad as explicit dependency though.
In any case, we can probably only publish after the next nomad release anyway, right?

I would argue we still should keep this independence. There should definitely be a standalone installation without nomad possible. We could create an optional dependency group, e.g. [nomad] that defines the required nomad version, though.

So, we could have a [nomad] extra in pynxtools pyproject.toml and then in nomad-distro, we would use pynxtools[nomad]==<some-version> to ensure compatibility?

If I remember correctly, there were problems when using optional dependencies (like we had before with [convert]) and then running using uv pip compile like here, but maybe this is different.

In pynxtools[nomad], nomad-lab >= 1.3.16 should work though, having a fixed version there would instead complicate things.

@rettigl
Copy link
Collaborator
rettigl commented Apr 10, 2025

In pynxtools[nomad], nomad-lab >= 1.3.16 should work though, having a fixed version there would instead complicate things.

That's more or less what I mean, yes. Not sure about the uv pip compile, though. I guess we need to test it.

@lukaspie
Copy link
Collaborator Author

With this PR, the next pynxtools release would be incompatible with earlier versions of nomad right? Since pynxtools doesn't declare nomad as a direct dependency (not sure why?), this might result in some broken installations if they're not using the latest nomad version.

I think the original idea was that pynxtools can be used standalone (for data conversion), so it was meant to be a lightweight install outside of NOMAD. I think after pluginization, we should have probably added nomad as explicit dependency though.
In any case, we can probably only publish after the next nomad release anyway, right?

I would argue we still should keep this independence. There should definitely be a standalone installation without nomad possible. We could create an optional dependency group, e.g. [nomad] that defines the required nomad version, though.

So, we could have a [nomad] extra in pynxtools pyproject.toml and then in nomad-distro, we would use pynxtools[nomad]==<some-version> to ensure compatibility?
If I remember correctly, there were problems when using optional dependencies (like we had before with [convert]) and then running using uv pip compile like here, but maybe this is different.

In pynxtools[nomad], nomad-lab >= 1.3.16 should work though, having a fixed version there would instead complicate things.

Yeah sorry, this is what I meant, just a lower bound.

@lukaspie lukaspie force-pushed the metainfo-pkg branch 2 times, most recently from 0ae4739 to ca15b4f Compare April 10, 2025 13:05
@lukaspie
Copy link
Collaborator Author
lukaspie commented Apr 11, 2025

@rettigl I have implemented it now that the json file is created before the uv build step is run, so it gets shipped with the package to PyPI and is available after install. For editable installs, on the first run, we generate the package once and then afterwards, we can just reload it. The filename also contains the NeXus version, so if you change the definition, the schema package is automatically recreated. Are you happy with that? We also have nomad in the optional-dependencies now.

@rettigl
Copy link
Collaborator
rettigl commented Apr 11, 2025

@rettigl I have implemented it now that the json file is created before the uv build step is run, so it gets shipped with the package to PyPI and is available after install. For editable installs, on the first run, we generate the package once and then afterwards, we can just reload it. The filename also contains the NeXus version, so if you change the definition, the schema package is automatically recreated. Are you happy with that? We also have nomad in the optional-dependencies now.

This is totally fine. I only think it would be good to avoid doing every time you start up nomad, which already takes quite long for building the nexus metainfo.

echo "ERROR: nxs_metainfo_package_*.json not found in installed package."
exit 1
else
echo "Found $FILE_COUNT JSON file(s) in package:"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we test if the found package(s) correspond(s) to the actual definitions submodule?

# Updates the definitions and generates the metainfo package.
# Only works if you have pynxtools installed in editable mode.

update_nexus_version() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot we either put the common functions to a common file being sourced, or make a generic script which behaves differently based on a parameter (update vs. reset)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed to a single script with a parameter

figure_host = archive.data
# apend the figure to the figures list
figure_host.figures = [PlotlyFigure(figure=fig.to_plotly_json())]
for section in sections:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we gain additional time, if these attributes are also saved/loaded?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thy are saved/loaded with this approach.

@lukaspie lukaspie linked an issue Apr 23, 2025 that may be closed by this pull request
@lukaspie lukaspie force-pushed the metainfo-pkg branch 3 times, most recently from 118a5ec to 6286ec1 Compare June 10, 2025 11:09
@lukaspie lukaspie force-pushed the metainfo-pkg branch 2 times, most recently from 9bb456e to eb4aea2 Compare June 30, 2025 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Lazy load of pynxtools schema in NOMAD
5 participants
0