8000 Hash hfive by mkuehbach · Pull Request #116 · FAIRmat-NFDI/pynxtools-em · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Hash hfive #116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: emapm_database_reprocessing
Choose a base branch
from
Open

Hash hfive #116

wants to merge 3 commits into from

Conversation

mkuehbach
Copy link
Collaborator
@mkuehbach mkuehbach commented Apr 11, 2025

Preparation tasks for the macro-issue "adding proper and meaty unit tests to pynxtools-em"

  • Code whereby to generate such reference reports from HDF5 files that can be used in unit tests
  • Possibility for a blacklist whereby to ignore nodes in HDF5 that store e.g. timestamps and because of this
    are expected to have a different binary content
  • Reporting for each dst, and attrs the datatype and SHA256 from the content will also help to substantially
    improve the chances to spot if using different versions of libraries in pynxtools-em result in different
    numerical results when the content otherwise i.e. datatype and HDF5 template path wise looks right
    Using a checksum that evaluates the bits will help spot these issues.

In a nutshell, this PR prepares that one can assert the content of the filtrated instances data from two HDF5 files,
like we in unit tests already do it in e.g. nyaml for XML and YAML files.

There are two more issues related to unit tests for pynxtools-em but these will be handled via other issues:

  • Get DOIs and checksums of the currently used example files for each parser and eventually replace those
    examples that we cannot share, because the plan is to upload the test data to e.g. zenodo and then have
    CICD download the testdata from zenodo in a loop for each test (parameterized)
  • Implement these tests, essentially download the example run the parser, generate the yaml artifact,
    compare with the reference.
  • A script to assist developers with updating the yaml artifacts all at ones if needs prior commiting
  • Check for possible downsides e.g. too frequent warning if random bit flips occur on the CICD server
    whereby payload data will be altered despite correct computation and thus the hash may turn out
    different. Currently we assume though in all our tests that these CICD servers are modern and
    well-maintained hardware so this scenario is rather hypothetical but still should be kept in mind

This is an exemplar snippet from such yaml artifact:

/@default: is_a_attr__str__c17dd9010a5c6b0e5b2ad5a845762d8b206e6166a4e63d32deca8c5664fdfcac
entry1: is_a_grp
entry1/@NX_class: is_a_attr__str__672a7eb0757166d49b2ff8f693193bdf449174535b90c3c5cd784dfb99398c7a
entry1/@default: is_a_attr__str__fba586be3b6f140b30389654d548a660d3a746cf8344ab6f39248caf65e2da4d
entry1/autophase: is_a_grp
entry1/autophase/@NX_class: is_a_attr__str__b8186fc624c1e2c6dac07b7c0cb0050ca74a1bdbbabdaef43fa0cb6780c13c0e
entry1/autophase/result: is_a_grp
entry1/autophase/result/@NX_class: is_a_attr__str__2ad9a930766abd87e93c7e7a69ee7bf9abbc8d771084e01bec20fd90455b8723
entry1/autophase/result/@axes: is_a_attr__str__c3a37d314c9728fa2860de244d90a005e20b88aedc23d6e47c20843c9ed37f14
entry1/autophase/result/@axis_feature_identifier_indices: is_a_attr__uint64__e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
entry1/autophase/result/@signal: is_a_attr__str__eb07f39b11350a5505ab130cd65fde3a66dd3b4183660d3ae4f3ad50232d043b
entry1/autophase/result/axis_feature_identifier: is_a_dst__uint64__dac8f5086e88078c7df98b4711520317bab2eefce1f6f9d47590300d6eacc693
entry1/autophase/result/axis_feature_identifier/@long_name: is_a_attr__str__c46353d60986dccb0f8dd3dc831e5ab3a242cb7a739eba2ef8c4eb5189306fbe
entry1/autophase/result/axis_feature_importance: is_a_dst__float64__e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
...

…the binary payload of each group, field, and attribute hashed and the possibility to blacklist entry that e.g. contain timestamps that would thus change, idea is to create lean fingerprints of NeXus and HDF5 files that can be used in unit tests to avoid storing the binary HDF5 files and solve the issue that HDF5 tools like h5diff yield verbose lists of differences where to the best of my knowledge you cannot exclude certain entries because of which whenever timestamps with instance data like for NeXus concepts start_time, end_time are involved one cannot assert the reference and the unit test result as these will always be different, using a hash of the binary payload also assures that the datatype representation factorizes into the checksum as the same number as an int32 and single precision will have a different binary representation except for a few edge cases, next steps, check that hashing function in gt_checksum really processes the entire bytes object and not just e.g. the first 4096 bytes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0