Hash hfive #116

mkuehbach · 2025-04-11T12:21:39Z

Preparation tasks for the macro-issue "adding proper and meaty unit tests to pynxtools-em"

Code whereby to generate such reference reports from HDF5 files that can be used in unit tests
Possibility for a blacklist whereby to ignore nodes in HDF5 that store e.g. timestamps and because of this
are expected to have a different binary content
Reporting for each dst, and attrs the datatype and SHA256 from the content will also help to substantially
improve the chances to spot if using different versions of libraries in pynxtools-em result in different
numerical results when the content otherwise i.e. datatype and HDF5 template path wise looks right
Using a checksum that evaluates the bits will help spot these issues.

In a nutshell, this PR prepares that one can assert the content of the filtrated instances data from two HDF5 files,
like we in unit tests already do it in e.g. nyaml for XML and YAML files.

There are two more issues related to unit tests for pynxtools-em but these will be handled via other issues:

Get DOIs and checksums of the currently used example files for each parser and eventually replace those
examples that we cannot share, because the plan is to upload the test data to e.g. zenodo and then have
CICD download the testdata from zenodo in a loop for each test (parameterized)
Implement these tests, essentially download the example run the parser, generate the yaml artifact,
compare with the reference.
A script to assist developers with updating the yaml artifacts all at ones if needs prior commiting
Check for possible downsides e.g. too frequent warning if random bit flips occur on the CICD server
whereby payload data will be altered despite correct computation and thus the hash may turn out
different. Currently we assume though in all our tests that these CICD servers are modern and
well-maintained hardware so this scenario is rather hypothetical but still should be kept in mind

This is an exemplar snippet from such yaml artifact:

/@default: is_a_attr__str__c17dd9010a5c6b0e5b2ad5a845762d8b206e6166a4e63d32deca8c5664fdfcac
entry1: is_a_grp
entry1/@NX_class: is_a_attr__str__672a7eb0757166d49b2ff8f693193bdf449174535b90c3c5cd784dfb99398c7a
entry1/@default: is_a_attr__str__fba586be3b6f140b30389654d548a660d3a746cf8344ab6f39248caf65e2da4d
entry1/autophase: is_a_grp
entry1/autophase/@NX_class: is_a_attr__str__b8186fc624c1e2c6dac07b7c0cb0050ca74a1bdbbabdaef43fa0cb6780c13c0e
entry1/autophase/result: is_a_grp
entry1/autophase/result/@NX_class: is_a_attr__str__2ad9a930766abd87e93c7e7a69ee7bf9abbc8d771084e01bec20fd90455b8723
entry1/autophase/result/@axes: is_a_attr__str__c3a37d314c9728fa2860de244d90a005e20b88aedc23d6e47c20843c9ed37f14
entry1/autophase/result/@axis_feature_identifier_indices: is_a_attr__uint64__e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
entry1/autophase/result/@signal: is_a_attr__str__eb07f39b11350a5505ab130cd65fde3a66dd3b4183660d3ae4f3ad50232d043b
entry1/autophase/result/axis_feature_identifier: is_a_dst__uint64__dac8f5086e88078c7df98b4711520317bab2eefce1f6f9d47590300d6eacc693
entry1/autophase/result/axis_feature_identifier/@long_name: is_a_attr__str__c46353d60986dccb0f8dd3dc831e5ab3a242cb7a739eba2ef8c4eb5189306fbe
entry1/autophase/result/axis_feature_importance: is_a_dst__float64__e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
...

…the binary payload of each group, field, and attribute hashed and the possibility to blacklist entry that e.g. contain timestamps that would thus change, idea is to create lean fingerprints of NeXus and HDF5 files that can be used in unit tests to avoid storing the binary HDF5 files and solve the issue that HDF5 tools like h5diff yield verbose lists of differences where to the best of my knowledge you cannot exclude certain entries because of which whenever timestamps with instance data like for NeXus concepts start_time, end_time are involved one cannot assert the reference and the unit test result as these will always be different, using a hash of the binary payload also assures that the datatype representation factorizes into the checksum as the same number as an int32 and single precision will have a different binary representation except for a few edge cases, next steps, check that hashing function in gt_checksum really processes the entire bytes object and not just e.g. the first 4096 bytes.

atomprobe-tc added 3 commits April 10, 2025 21:35

Merge branch 'emapm_database_reprocessing' into hash_hfive

53315bc

Completed testing and implementation of hashing HDF5 content node-wise

404275b

mkuehbach requested a review from sanbrock April 11, 2025 12:21

mkuehbach mentioned this pull request Apr 11, 2025

Tests for the plugin and its integration into NOMAD #23

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hash hfive #116

Hash hfive #116

Uh oh!

Uh oh!

Uh oh!

Hash hfive #116

Are you sure you want to change the base?

Hash hfive #116

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!