h5RDMtoolbox - A Python Toolbox for FAIR Data Management around HDF5

This is a Preprint and has not been peer reviewed. This is version 4 of this Preprint.

Authors

Matthias Probst  , Balazs Pritz

Abstract

Sustainable data management is fundamental to efficient and successful scientific research. The FAIR principles (Findable, Accessible, Interoperable and Reusable) have been proven to be successful guidelines to enable comprehensible analysis, discovery and re-use. Although the topic has recently gained increasing awareness in both academia and industry, the engineering sciences in particular are lagging behind in managing the valuable asset of data. While large collaborations and research facilities have already implemented metadata strategies, smaller research groups and institutes are often missing a common strategy due to heterogeneous and rapidly changing environments as well as missing capacity or expertise. This paper presents an open-source package, called h5RDMtoolbox, written in Python helping to quickly implement and maintain FAIR research data management along the entire data lifecycle using HDF5 as the core file format. One of the key features of the toolbox is the flexible, high-level implementation of metadata standards, adaptable to the changing requirements of projects, collaborations and environments, such as experimental or computational setups. Implementation of existing schemas such as EngMeta or the cf-conventions are possible and intended use-cases. Other benefits of the toolbox include a simplified interface to the data and database solutions to query metadata stored in HDF5 files.

Comments

Comment #123 Matthias Probst @ 2024-07-06 16:17

We would like to thank the reviewers for their careful reading and very helpful feedback. The new software version is now v1.4.0 and includes the suggested improvements. The documentation has also been updated and is now free of broken links and rendering problems. The latest version AND the v1.4.0 version of the documentation can be found, which documents the state described in the manuscript (v1.4.0 documentation in ref [11]).

Problems with listings, captions and unclear wording have been fixed. The conceptual changes based on the reviewers' comments are updated in the manuscript and in the code. In short, they concern interaction with repositories and working with databases:
1. The abstract class for repositories has been improved. In particular, the concrete implementation of the Zenodo subclasses has been simplified (only one is needed instead of two; switching between sandbox and production is done via a parameter). Uploading data is now simplified by using only one clear method (upload_file), which optionally takes a metawrapper function. It creates an additional metadata file for any input file format. Only if an HDF5 file is uploaded and no custom function is provided will the default conversion function to a JSON-LD file be triggered.
2. The abstract database interface is divided: The most basic interface (HDF5DBInterface) implements only the search methods, while the interface that uses external DBs to extend queries (ExtHDF5DBInterface).

We are grateful for all the feedback, which has improved the software considerably.

Comment #109 Kevin Logan @ 2024-05-29 16:32

I would like to thank the reviewers for their thorough work and detailed feedback.

Based on these latest reviews, I believe that the submission has been greatly improved in the latest version and I am confident that we will be able to publish it in ing.grid in due course.

I would like to encourage the authors to carefully consider the suggestions made in Invited Review Comment #88 and address the issues mentioned therein.

I look forward to a revised version of the software together with an update of the software descriptor.

Invited Review Comment #88 Dennis Gläser @ 2024-01-29 15:01

I would like to thank the authors for the effort put into refactoring the package, improving its documentation and fixing the version incompatibility issues. As said in the last review, I believe this package is a valuable contribution to the community, and the manuscript fits well into the Ing.Grid journal. I think the manuscript is well-written in general, but I have a few concerns about section 3 (see below). Both the manuscript and the code documentation still have a number of typos, broken links and typesetting issues (I will list some of them below). These could easily be fixed in the typesetting phase. However, in my opinion, some of these issues are critical for publication in Ing.Grid (e.g. point 2), so I suggest one more iteration in the review process.


Below the rather minor language and typesetting issues in my comment sections a), b) and c), I have written a rather lengthy review of general concerns I had during my interaction with the code, documentation and manuscript. I guess many of these are quite subjective and I don’t expect the authors to answer/address all of them. But I hope that they contain some good ideas that the authors may want to consider.



a) Typesetting issues:


1. Listing 3 is split by a page break and by a figure at the top of page 10. A page break is probably fine, but maybe the figure can be placed elsewhere to not interrupt the flow of reading too much?


2. In L.241, Listing 4 is referenced although I guess this should actually reference Listing 3. Listing 4 has the same description as listing 3, and it doesn’t fit to the code that is shown. I guess this is a copy-paste error, where both the caption and the label of the listing has not been modified?




b) minor language issues and typos in the manuscript:


3. Abstract (last sentence): “interface to THE repository and database solutions” -> As these have not been introduced at this point I would leave away “THE” to avoid confusion.

4. L.188: “As shown in…, a Convention objects takes” -> Convention object takes (i.e. singular)?

5. Figure 5, in the UnitAttribute component:

    a) the leading quotes on “units” are misplaced 

    b) “Phyiscal unit” -> Physical unit

6. L.205: Closing bracket seems to be missing after the link to figure 4.

7. L.206-210: I believe these two consecutive sentences state the same thing, i.e. that for more info the documentation should be consulted.

8. L.269: “… provides two option of” -> “… provides two options of” (i.e. plural)?

9. L.277: “… whether to download to large HDF5 file …” -> … to download THE large HDF5 file … ?

10. L.322 neglectable -> negligible?

11. L.350: “While existing solutions exist to address” (“exist” is repeated) -> While solutions exist that address.. ?

12. L.364: “Future work should set tje focus” -> the focus?



c) (some of the) issues in the online documentation:


13. The landing page of the online documentation is full of markdown links in the form of [placeholder](link), which are not rendered correctly. The documentation pages seem to be written in .rst, which uses a different syntax for links, I believe. The documentation in general looks very nice, and I think quite some effort went into it, so I am surprised that this was not noticed or overlooked.


14. Some image links seem broken or wrong, for instance: https://h5rdmtoolbox.readthedocs.io/en/latest/userguide/repository/zenodo.html


15. I had mentioned this in my last review as well: there are large code cells with keyboard interrupt traceback output (https://h5rdmtoolbox.readthedocs.io/en/latest/userguide/database/mongo.html). I still think this should not be in the documentation.


16. As said above, I stumbled across a number of typos, most of them don’t affect understandability, though. However, the section on `FileDB` is very confusing (https://h5rdmtoolbox.readthedocs.io/en/latest/userguide/database/hdfDB.html). It says that “The database we are going to use is the `GroupDB`. It takes an opened HDF5 group as input”, but the code snippet uses `FileDB`.  `GroupDB` does not seem to exist in the code base (a search on GitHub did not yield any hit).



d) question on the difference between `Layout` and `Convention`:


17. The way I understand it, the conventions are used such that when writing e.g. a dataset to a file while forgetting to define a standard attribute, one runs into a runtime error? For the case that the computation of the data is very time-consuming (e.g. a heavy simulation), it would be undesirable if one runs into that error at the end of the work when trying to write the result to a file. So I guess this is not the intended use case? Figure 4 suggest that the conventions are only used when converting existing data into hdf5?


In any case, with a layout one seems to be able to define things like “datasets must have the attribute ‘units’”. Isn’t this also what one can do with a standard attribute in a convention? Can one create a layout from a convention for subsequent validation? To me, the distinction between the different things is not totally clear in the manuscript. The term “validator” appears in the discussion of conventions; it is also said that a “layout is a static validator” (what does static mean here?); And, as said, conventions and layouts seem to have overlapping use cases. 



e) comment on section 3.4


18. The caption of listing 4 seems wrong (see my comment in the typesetting issues). But besides this, it would be good if listing 4 was referenced in the text somewhere and described a little bit. For instance, in which form is the metadata passed to the constructor of the `zenodo.metadata.Metadata()` object? The listing just shows “…”. From looking into the code it seems to me that one passes named arguments (i.e. key-value pairs). Maybe one can show this in the listing, e.g. via comments like


```

repo.metadata = zenodo.metadata.Metadata(

    title=“my-zenodo-dataset”

    # pass your metadata as named arguments as shown above for the field “title”

)

```


As a side note, from looking into the code it seems to me that there are mandatory fields that are checked via Pydantic. The mandatory fields probably depend on the actual repository (e.g. Zenodo)? The online docs show an example, but they don’t list which fields are mandatory or which ones are possible (for a specific repo)? This could be very helpful. The text in the manuscript could also include a short comment on this.



f) comments on section 3.5:


19. The section starts with that the “toolbox implements two ways of sharing and re-using by means of databases”. The way I see it, the purpose of the database interface is to query/filter datasets of interest conveniently and (potentially) efficiently? How is the “sharing” aspect addressed by the database concept? I think the aspect of “sharing” is addressed with the “repository” sub-package, not with “database”. I would not even speak of reusing, but rather exploring data? I think the reusing aspect is addressed with the core wrapper.


20. The caption of figure 9 is difficult to parse, and I think it may actually be wrong. It starts with “The metadata of the HDF5 files can be mapped to a MongoDB database and then filtered”. But then it says that “the other option” (I would also actually spell out `FileDB` here) “returns the data directly”. This sounds as if the two solutions had a different interface and promote different usage? That is, “MongoDB” filters metadata but “FileDB” returns it directly? But then it says “for both solutions, the query functions and return values are the same, because the interfaces are inherited from the abstract database interface class”. From the code it seems to me that this latter statement is correct, that is, they both return the same lazy objects that automatically fetch the data upon request? To sum up, is the “and returns the data directly” actually true and needs to be stated?


Besides that, I don’t think that for Python the abstract base class enforces any guarantees on the return type. Type checkers will complain, yes, but I think the interpreter happily takes a derived class that returns some other object from the abstract method. Thus, I think the argument that “because of the inheritance from the abstract base class the return values are the same” is incorrect.


21. L.300 until the end of the paragraph: this is again hard to understand. As in the first review, I still think “compatible with a dedicated system” is confusing here. Again, I suppose the authors want to say that the “sequential query” when using `FileDB`/`FilesDB` is less efficient than performing queries on a database? But the two sentences that follow are also hard to parse. Which inefficiency is outweighing? Do the authors want to say that the effort of setting up the database (in order to get faster queries) is not worth it when inspecting a single file or a few files, only?


Finally, the last sentence is also confusing: “in addition, this concept, as implemented in the toolbox, requires no prior operations on the data and only takes a minimum number of lines for the user.” Only after looking into the online examples I believe I understood what the authors mean. It seems that when using the MongoDB approach, one has to insert the datasets manually into the database, is that the reason why the authors state here that the FileDB approach “requires no prior operations on the data and only takes a minimum number of lines for the user”? The prior operations mean the manual insertion into the database? I would either remove this sentence or give the reader more context to understand it.


22. “the second approach extracts the file information and all metadata… and writes it into the MongoDB”. This sounds as if the library does that for the user, but from what I could see, one has to manually insert the datasets. I saw that `FilesDB` has a factory function `from_folder`. Couldn’t the package expose something similar for a MongoDB such that the user just gives it a folder and all metadata is collected automatically and inserted into the database? (For a more fine-grained setup, if needed, one could still do it manually).



g) comments on section 4. 


23. L.329: “… limits its compatibility to Python versions.” Why “versions”? And which versions? As far as I see, the package is compatible with all currently maintained Python versions.



h) comments on the code:


24. While I think it is a great idea to have the possibility to upload either a single file, or a file with metadata alongside it, I am not sure if the interface names `upload_file` and `upload_hdf_file` are self-explanatory. As I understand it, `upload_file` is used, or at least can be used, for uploading an .hdf file (.hdf being the format of choice, anyway). Thus, I think adding `_hdf_` to the function name does not really suggest that this now creates and uploads metadata? Why is it not simply called `upload_file_with_metadata` or something like that?


25. The “database” sub-package exposes interfaces like “insert_dataset” and “insert_group”. From what I understand, the datasets and groups are not really inserted into the database, but only their metadata? In this sense, I am not sure about the naming of the functions and the classes. It seems to me that this is actually rather a “Metadatabase” that allows the retrieval of handles to the actual underlying data for convenience? But it does not allow insertion of actual field data?


This issue seems to also be reflected in the code: the `HDF5DBInterface` exposes 4 methods, two of which throw runtime exceptions in the case of `FileDB` and `FilesDB` the way I see it (i.e. the inserter functions). So I am not sure if this is the right abstraction, or maybe one layer of abstraction is missing. Since two of three implementations don’t implement the inserter functions, should they be part of the database interface? Maybe only the querying is the common part and the insertion functions are specific to the implementations? 


Finally, if users did not have to manually insert datasets/groups, but the `MongoDB` class had some sort of factory function (see other comment), then the two functions would maybe not even be necessary.


26. The file name for `NonInsertableDatabaseInterfase` is `nonsearchable.py` (https://github.com/matthiasprobst/h5RDMtoolbox/blob/main/h5rdmtoolbox/database/hdfdb/nonsearchable.py)


27. The docstring for `FilesDB` seems to be a copy from `FileDB` and I believe it is wrong (https://github.com/matthiasprobst/h5RDMtoolbox/blob/main/h5rdmtoolbox/database/hdfdb/filedb.py#L49)

Invited Review Comment #84 Sangeetha Shankar @ 2024-01-17 17:34

Outcome: Accept (minor changes are to be done)

 

Detailed review:

Thanks to the authors for addressing the comments and for updating the software, its descriptor and the documentation. The authors have summarized the existing solutions from different scientific disciplines as well as the current gaps and needs for a generic approach to manage HDF5 datasets in a FAIR-compliant way. I find the new structure of the Methodology / Concepts and Architecture section more suitable for a software descriptor, as it describes in detail along with examples the five sub-packages of h5rdmtoolbox as well as the stages of the data lifecycle where these sub-packages are useful. Wherever the toolbox provides multiple methods to accomplish a task, the authors have explained the pros and cons of each method. The limitations of the toolbox are also explicitly mentioned in chapter 4 along with comments on performance of the toolbox.

In section 3.4, two approaches to upload the HDF5 file to a repository is explained – 1. To upload the file as it is (method: upload_file()) 2. To generate metadata from HDF5 file using a metamapper function and upload the HDF5 file along with a JSON file containing metadata (method: upload_hdf_file()). I understand from the text that the metamapper function generates information on the structure and the attributes of the HDF5 file. However, further details on the metamapper function “hdf2json” mentioned in listing 4 is not found in the documentation. I suggest adding to the documentation an explanation of this function with examples. It is also unclear whether the metadata provided by the user in line 284 and those generated in line 287 are stored together. Furthermore, I recommend rethinking/changing the method names “upload_file” and “upload_hdf_file”, as the names are not representative of what the functions do. As a user, I can imagine myself looking into the documentation often to find which of these two functions generate additional metadata from HDF5 files while uploading.

The documentation has been significantly improved since the previous review. It is very well-structured with plenty of information for the users to refer to when using the toolbox in their scientific research. Metadata of the software project is present in the form of a CodeMeta file.

Apart from these, I propose the following minor edits:

- Caption of figure 3 – I suppose, the word “convention” should be italicized as it is the name of a module.

- Line 205 – closing bracket is missing.

- Line 241 – From the explanation, I think Listing 3 demonstrates the workflow in figure 6, and not Listing 4 as mentioned in the text.

- The placement of figure 6 must be changed as it currently sits in the middle of the code in Listing 3.

- Listing 4 is not mentioned in the text.

- #17 in the references list seems to have only one author, but the author name occurs twice. It also seems to be a book; hence ISBN could be added.

- https://github.com/matthiasprobst/h5RDMtoolbox/blob/main/codemeta.json - one of the authors of the paper, Balazs Pritz, is not mentioned as an author or a contributor.

- ROR ID of the research organization could be added to the CodeMeta file.

Comment #82 Kevin Logan @ 2023-12-22 11:41

Dear Authors,
Thank you for your efforts towards improving your software package and the software descriptor, and your extensive response to the issues raised by the reviewers. I have invited all three reviewers to reassess your submitted software package and software descriptor in view of the changes and your response. A new decision on your manuscript will be made based on the new feedback of the reviewers.

Comment #80 Matthias Probst @ 2023-12-20 16:42

I forgot to comment on data availability: There is no clear source of data for this software. We are shipping most of the (example/test)data with the GitHub repository, some examples use data downloaded from Zenodo. My best guess is, that I provide the GitHub-repository link.

Comment #79 Matthias Probst @ 2023-12-20 16:39

The new version of the manuscript uses a different structure to enhance the readability and understanding of the scope and capabilities of the toolbox:
1. A section “scope and related work” is added. It outlines existing solutions to manage HDF5 files and stating the need for a holistic approach including all aspects of working with HDF5 in the context of a lifecycle of research data
2. The section 3 is now the main part of the manuscript. It explains the concept and architecture of the package. It does this not by aligning to the lifecycle but rather to the sub-packages of the h5rdmtoolbox. This is a cleaner way and results of some refactoring of the toolbox based on the helpful reviewer comments about the code. Moreover, the new structure avoids jumping back and forth as it was the case in version 1 of the manuscript, where the lifecycle was the basis for explaining the toolbox. However, the data lifecycle is still important and the paragraphs of the sub-packages state their relevance in the cycle and for their contribution to the FAIRness of HDF5 files.
3. The references list got updated as requested
4. The authorship got clarified by thanking the user registered in github repository and adding him to the codemeta file
5. Some of the explanations are outsourced to the documentation as they might get too confusing in the written text, and need direct experience (possible with the used Jupyter Notebooks). However, minimal examples are provided as code examples, most of the time in combination with a figure.
6. Code listings now have captions. The issue with the overlaying numbers seems to be an issue with the template.
7. A section about limitations, including the remark on possible performance issues, is added.
8. Figures got updated. The color-issue got solved by using white font in figure 1 (unfortunately after uploading the new version, I saw, that only one text took the color. I will update this together with a new version)
9. I am not sure about the remark stated by Nils Preuß concerning the license. Is there a conflict between the software license (MIT) and the publication? If yes, I would be glad to get advice on how to resolve it.
If we missed something, we are happy to update the manuscript and the documentation (the latter is of course constantly evolving. At the current state, we are focusing on improving the structure and readability rather than adding new features)

Comment #76 Matthias Probst @ 2023-12-18 20:12

Dear reviewers,

I would like to comment on the issues raised about the implementations first, before commenting on the issues raised about the manuscript.

As of today, a new version (v1.0.0) is released and is available on pip. The documentation is also updated according to the changes listed below. We tried to be more concise here, added more information about architectural details. Broken links were fixed. Fixing language editing in the documentation is work in progress.


The major changes as a consequence of the helpful and constructive feedback are:

• The architecture of the toolbox got updated: It now has 5 dedicated modules or sub-packages. Only the wrapper (around HDF5) is using the convention module (c.f. https://h5rdmtoolbox.readthedocs.io/en/latest/_images/h5tbx_modules.svg)

• The toolbox is now better maintainable and allows adding new databases and repository interfaces beyond the one implemented. This is achieved by a clear object-oriented approach, using abstract classes defining the “rules” of the interfaces.

• The issue with the Zenodo API is resolved and the code is put into the toolbox code and not part of an external package anymore.

• The repository module explicitly supports uploading and downloading from repositories and promoting the aspect of sharing.

• A solution to assign persistent identifiers to HDF5 attributes is added (IRI). This allows to achieve better interoperability.

• The formal description of the package is improved by adding the file codemeta.json.

• Extensive testing for all current Python version starting at 3.8 and for all operating systems is included in the test pipeline.

• The documentation is updated according to the changes.


Detailed comments with references to the review comments:

• A codemeta.json file was added and is now described formally (@#64: a version number was included in the setup.cfg file as common practice for python packages and is now also available in the codemeta file. Should I be missing another place to put a version, I am happy to fix that.)

• The package is now tested against python 3.8 until 3.12 for mac, windows and linux (see https://github.com/matthiasprobst/h5RDMtoolbox/actions)
This should solve the issues raised in comment #67 and #68 concerning system requirements and testing

• Issues with the documentation and broken links as mentioned in #67 (point 4. And 5.) are fixed

• Concerning the Zenodo api in comment #67 and #64: Zenodo.org changed the API. The full dependency on the package zenodo_search was resolved by making the code part of the software and the communication to the Zenodo api was updated. This led to two outcomes: 1. Files can be downloaded from Zenodo again and must not manually be downloaded. 2. The module/sub-package “repository” has been added, which provides an abstract interface class. One such realization is the Zenodo interface. Hence, additional interfaces can be implemented by others without touching the rest of the code. Reference to documentation: https://h5rdmtoolbox.readthedocs.io/en/latest/repository/zenodo.html
The interface to Zenodo is explicitly part of the toolbox. It allows users to upload and download HDF5 files (see https://h5rdmtoolbox.readthedocs.io/en/latest/repository/zenodo.html).
I hope, that this answers also the point 6 in comment #67.
I will address the repository-part and modular design in the manuscript, too.

• Referring to point 8 in comment #67: I agree, that the aspect of sharing was not backed up strong enough. Through the above described implementation of the module “repository”, it hopefully gets a bit clearer. The explicit implementation of repository solutions by means of an abstract class, which enforces the provision of a DOI should fulfill the aspect of “sharing”.

• I added the option to assign HDF attributes to IRIs: https://h5rdmtoolbox.readthedocs.io/en/latest/convention/ontologies.html). This was not possible before and to my knowledge, no other HDF-related software supports it. With this, I try to further strengthen the FAIRness of HDF5 files, especially the aspect of interoperability. I hope to adequately answer the comment in #68 about the use of vocabularies here. I will add a part in the manuscript, too.

• Thanks to the helpful comment in #67 concerning the implementation of the databases: Just like the repository module, the database was refactored such, that it is now very modular: New database interfaces can be easily added. Again, an abstract class is setting the “rules” for future implementations and successful work with HDF5 files. Link to documentation: https://h5rdmtoolbox.readthedocs.io/en/latest/database/index.html

• A comment on performance, as mentioned in comment #68: The primary goal of the toolbox is achieving richly described HDF5 files and to provide tools to interface with databases and repositories. This adds additional overhead and certainly makes working with HDF5 not faster – but it makes it FAIRer. The toolbox does not limit the underlying package h5py and therefore capabilities of working with HDF5 through Python. Certainly, the I/O gets a bit slower due to interface with xarray and the additional metadata validations. I was not planning on conducting speed tests. However, I will mention the scope but also limitations in this respect in the new version of the manuscript. Would something like a speed test desired?


The next comment will be on the new version of the manuscript and on the remaining remarks in the comments.

Comment #70 Matthias Probst @ 2023-11-22 07:59

To all reviewers, thank you very much for the constructive and helpful feedback. We will revise the manuscript and the code accordingly and respond to the individual comments as soon as possible.

Comment #69 Kevin Logan @ 2023-11-15 08:24

As topical editor of this submission, I want to thank the reviewers for the time and effort taken to compose the detailed and constructive feedback presented in the review comments. Following their recommendation, I request the authors to revise their manuscript so as to address the issues raised by the reviewers and resubmit the manuscript. Furthermore, the authors must submit a comment in response detailing how they addressed the points raised by the reviewers.

Invited Review Comment #68 Sangeetha Shankar @ 2023-11-14 15:23

Outcome: Revise

 

Detailed review:

Software descriptor:

The h5RDMtoolbox presented in this paper is a great initiative to enrich HDF5 files with metadata at all stages of the data lifecycle. The concept is generic and can theoretically be used in various research domains. In the introduction section of the software descriptor, the authors have explained the need for easier ways for FAIRification of data. However, more efforts are required in terms of stating the novelty and originality of the work as well as the state-of-the-art approaches on handling of HDF5 datasets; particularly, information on similar existing tools is missing.

The paper is well-structured and easy to follow. Links in the paper pointing to the software as well as the example are functional. The use of xarray package to read data from the HDF5 together with its metadata is a wise choice. On the other hand, the example provided in section 3.3 does not clearly convey the benefits of use of metadata/attributes. From figure 5, I assumed that the units of measurement are stored as attributes, which are fetched by h5RDMtoolbox when creating the plot (this assumption was confirmed to be true while going through the tutorial). The authors could improve the figure as well as its textual explanation. In this case, it would be helpful to show a snippet of the yaml file as well.

The article could summarize which aspects of FAIR the proposed solution intends to improve and how. This information is scattered throughout the paper and it would be beneficial to bring them together, for example, in the form of a table. The article could also contain a comparison of the functionalities of similar tools with h5RDMtoolbox. As h5RDMtoolbox extends h5py python library, the authors could provide a statement on whether there are any differences in performance while executing similar operations with h5RDMtoolbox and h5py and whether the size of the hdf5 files affects the performance of their toolbox.

Some colors used in Figure 1 may be hard to distinguish for persons with difficulties recognizing red/green colors. To make the figure easily readable for everyone, the authors could consider switching to a color-vision-deficiency-friendly color palette which can be defined using online tools.

Furthermore, I strongly recommend the authors to recheck the references list. Some items are incomplete, contain duplicate information, wrongly formatted or contain broken links. It is also suggested to add DOI of publications (or ISBN for books), wherever available.

 

Tool and its documentation:

I agree with Nils Preuß, that the authors have put in lot of effort into the development of the tool, which is evident from their clean and well-structured code as well as the documentation of the tool. The coding style is good and helps the reader easily understand the code. The code is sufficiently commented and tests are included. However, metadata on the software in a standardized format is missing.

The tutorial on getting started with the toolbox is functional and gives the users most of the basic information required to use their toolbox. On the contrary, it is not easy to understand how to use the toolbox for an existing HDF5 file created in a different context (I was attempting to use the toolbox to add metadata to an HDF5 file containing multi-sensor data from railway environment). The authors mention in the conclusion that they intend to test their toolbox with data from different scientific disciplines. However, the documentation does not provide sufficient guidance to the users on creating a new conventions (yaml) and new validators. I strongly recommended the authors to create a dedicated section in the documentation for this topic. Alongside that, the documentation could list the validators that are already defined in their toolbox and explain the validation they perform. Furthermore, the use of vocabularies plays a great role in making the datasets interoperable. It would be worthful, if the authors make a comment on the possible use of vocabularies in their solution and its need/role in sustainable development and reuse of their toolbox.

I also came across many spelling and grammatical mistakes in the documentation. The flow of contents in the documentation could be improved. Furthermore, the link in the readme file pointing to the documentation and examples is not working.

System requirements are missing in the documentation of the tool. If there are no specific system requirements, this can be explicitly mentioned. Also, it is unclear whether the tool was tested in different operating systems. The authors can make a statement on the working of the tool in different operating systems.

 

Comment on authorship:

The list of authors of the manuscript is not consistent with the list of developers of the software. In GitHub, there are contributions from a person with username ‘lucasbuettner’. This person is neither listed as an author of the paper nor mentioned in the acknowledgements section. The authors could add to GitHub a list of current and past contributors to the project. Conflict of interest of the authors is also missing in the paper.

Invited Review Comment #67 Anonymous @ 2023-11-12 15:58

The authors present a Python package that provides a high-level interface for working with HDF5 files and validating metadata  according to custom metadata conventions. The interface to HDF5 files is a thin wrapper around the h5py package and exposes a similar API, with one difference being that data arrays of the xarray package are returned when querying datasets. Beyond that, a query mechanism to search and filter datasets in individual hdf5 files, the local filesystem, or MongoDB database instances is provided. Regarding the above-mentioned metadata conventions, the authors introduced a file layout in yaml syntax that allows to define such conventions in a human- and machine-readable format. A convention can be loaded into the package, which will include metadata validation steps against the loaded convention in subsequent I/O operations from or to HDF5 files with the package.


I believe this software is a valuable contribution the scientific community and could be useful for researchers in various disciplines. Moreover, it fits well into the scope of the Ing.Grid Journal. However, before I can fully recommend this manuscript for submission, I would be grateful if the authors could address or respond to the following comments:


1. Unfortunately, attempting to run the quick start example was not without issues. Installation via pip fails with python3.11, I guess this is because the setup.cfg of the repository requires python < 3.11. However, the README only states “tested until 3.10”. Is python<3.11 a hard requirement or can it be loosened? If yes, I suggest to do so as there is already a python3.12 stable release. If not, the readme should probably be rephrased.


2. There also seems to be an issue with python3.10 on MacOS - I wasn’t able to reproduce the error on a machine that runs on Ubuntu22.04. On MacOS with python3.10, the quick start example raised an exception when loading the convention. After downgrading to Python3.9 it worked. If this can be confirmed, maybe it would be good to state this in the README such that users are not surprised when the quick start example does not work. By the way, I got the same error in the provided google collab notebook.


3. I think both the manuscript as well as the online documentation could use some language editing. I stumbled across a number of typos in the online documentation, as well as a few in the paper (I will list some them at the end of this comment). 


4. Two more things regarding the online documentation: (1) the link in the README in the first sentence of the “Documentation” section yields a 404, and (2) on the following page on the database features, there is a large code cell containing the traceback of a keyboard interrupt exception (https://h5rdmtoolbox.readthedocs.io/en/latest/database/h5mongo.html#first-things-first-connection-to-the-db). Is this intentional?


5. In several places in the online documentation, it seemed to me that there are escape characters that shouldn’t be rendered. For instance, here: https://h5rdmtoolbox.readthedocs.io/en/latest/database/Serverless.html#advances-dataset-searches, above the first code box. Shouldn’t this be simply “$eq” and “$gt”?


6. The README of the repository states that the toolbox supports “(4) sharing data,.. e.g. to repositories like Zenodo”. I was scanning the API documentation but I couldn’t find anything related to this. Does the package provide an interface to automatically create a Zenodo dataset from a local file system or MongoDB instance? If not, I suggest that this part of the README is rephrased. The same holds for “(5) reusing data”. I could only see how conventions are “reused” from zenodo, but not entire datasets. But besides this, I am not sure if such thing should actually be part of the responsibilities of this package. 


7. I suggest to add some more context to the part of the manuscript where the conventions are first introduced (section 3.1). The way I understand it, when a convention is loaded with `h5tbx.use(cv)`, the metadata of subsequent I/O operations with the toolbox is validated against this convention. I think it would help to state something like this before jumping down to details like “target_method”. The way I understand it is that when choosing “target_method: create_dataset” on an attribute, one can (for instance) make an attribute mandatory for every dataset that is created? As said, I think some more explanations in what sense the loaded convention affects the API and how it relates to the terms in the yaml file could be useful here.


8. I am not so sure to what extent the package addresses the aspect of “sharing” as stated in the manuscript, in particular in Section 3.4 “Sharing and re-using”. I think the reusing aspect is fine, as the toolbox provides query and filter mechanisms that I can use to explore the dataset of someone else. But I think that “sharing”, in particular in connection to the FAIR principles that are referenced in the manuscript, would rather refer to making the data publicly accessible somewhere behind a persistent identifier, etc. Maybe it is just a wording issue and I am misinterpreting this.


9. It would be nice if there was a concise summary of what it is that the package provides. In the “conclusion and outlook”, there are the three bullet points, but maybe it could help if such an overview was placed at the beginning of Section 3? This way, maybe the reader is more prepared for what is to come in 3.1-3.4.


10. Just a minor comment/question regarding the implementation of the support for MongoDB instances. I was a bit surprised to see code like `h5[“dataset”].mongo.insert(…)`. So the mongo accessor is part of the dataset interface? Wouldn’t it make sense to have a generic database interface that could be implemented for different backends (i.e. MongoDB, the local file system, or also other database system), and have that as a separate object and not part of the dataset interface? That way the dataset itself would be decoupled from the database, and one could provide backends for different database implementations without having to touch the dataset implementation (there could be a protocol on the API that a database implementation must fulfil to enable type checking).


A few of the language issues I stumbled across in the manuscript:


l. 97: “…, the toolbox requires a high-level interface…” -> the toolbox provides?

l.119: “It will fit most engineering data and proofs to be suitable to harmonize heterogeneous source data.” -> … proofs to be suitable to harmonise data from heterogeneous sources?

l.291: “…, the performance of finding data is not compatible with a dedicated system”. I am not sure what the authors are trying to say here. Maybe “… the performance of finding data is not comparable to that of a database system”?

l.321: ... “extensive documentation automatically created and published…” -> IS automatically created…?

Invited Review Comment #64 Nils Preuß @ 2023-11-06 13:25

Overall assessment: revise

The authors present a Python tool called h5RDMtoolbox Python for help with implementation and maintainenance of FAIR research data management along the data lifecycle using HDF5 as the core file format.
The authors clearly made an effort to ensure code quality, as well as to provide extensive documentation, usage examples and tutorials. However, the manuscript / software descriptor would profit from a more careful presentation, especially a discussion of how the software compares to other similar and related alternative software in terms of scope and provided functionality, as well as the software architecture and usage. While the document is well formatted and stylized, the writing / language in general could be improved.
More detailed comments are provided below.



Detailed comments (c.f. https://www.inggrid.org/site/reviewerguidelines/):

Statement of need:
    The introduction focuses a lot on the challenges and importance of metadata and research data management in a very general sense. Section 2 elaborates on typical workflow steps using the concept of the research data life cycle, however also in a very general sense, without references to literature, and more importantly, without reference to typical workflows and challenges the target audience may encounter.

Discussion of how the software compares to other similar and related alternative software in terms of scope and provided functionality:
    The article would benefit from keeping the contents of the current introduction a lot more concise, and introduce a typical workflow including its current challenges and current solutions (withour the presented tool) along the data lifecycle steps in section 2. This allows readers to understand the purpose and scope/limits of the tool, recognize the needs that the presented tool adresses much more clearly, as well as to compare the workflow that does use the tool presented in later sections.

Novelty of the approach:
    Clarification regarding the needs and/or current typical workflows including typical solutions or workarounds (as suggested above) would emphasise the novelty of the presented approach.
    The article would benefit from paragraph in section 3 (methodology) contrasting the presented solution for metadata conventions with competing popular approaches based on semantic web technologies or schemas. At least some assessment on how the approach addresses selected aspects of the FAIR principles would be much appreciated.

Software architecture and usage:
    The article is very unclear as to how a user of the tool interfaces with the introduced metadata conventions, or how the metadata validators are defined or implemented, as well as what their scope and limitations are.
    Which workflow steps / user interface features are metadata aware, in which way? My best guess is, the example dataset references h5 dimension scales (a hdf5 feature that allows automatic translation of dataset indices to numeric information, e.g. spatial or temporal, about the "grid" of the dataset) which are then described by standard metadata to provide information about physical quantities and units for plotting (or maybe consistency checks)?


Since the available documentation is very comprehensive and detailed already, the article could leverage this by addressing those open questions concisely where appropriate and pointing to more detailed explanations in the form of such supplementary material via a clear reference / link if needed.



Additional notes (c.f. https://www.inggrid.org/site/reviewerguidelines/)

Software Review Requirements:
    The software code is deposited to a repository. The code is open to be accessed, viewed, browsed and cloned anonymously.
    The repository contains a README file explaining installation procedure, dependencies as well as purpose, scope and usage. The installation procedure is automated via pip.   
    Instructions on how to contribute to the project is included in the repository. It references instructions regarding docstrings, docs and tests.
    The software submission does not depend on proprietary software.
    Installation instructions clearly and comprehensively detail all required dependencies, including required version numbers. Python specific standard automated dependency management tooling is used.
    The software code complies with formal standards regarding commenting, formatting conventions, coding standards and structure. It is easily readable and makes use of community or language-accepted common code structures.
    The repository uses version control.
    The repository allows issue tracking and submission of issues.
    The software adopts automated tests.

    A working example of the software is included via documentation of the software repository. Required minimal example data (convention file) is provided via external data repositories (zenodo), manual download and adaption of the software example is necessary due to changes in zenodos API (zsearch.search_doi() behaviour is affected by this).
    The software is NOT described by metadata using a formal, accessible, shared, and broadly applicable language for knowledge representation (cf. https: //codemeta.github.io/). Some metadata is provided however via the setup.cfg file.
    The software is licensed under an Open Source license (CC-BY). The repository contains a plain text licensing file. It includes copyright and licensing statements of third-party software (gitbutler?) that are legally bundled with the code in compliance with the conditions of those licenses. It does NOT however refer to the CC-BY license used for the submission. The github page refers to a MIT license.

Software Descriptor Requirements:
    Version number of the code: Missing
    State of the art and novelty of the approach: See detailed comments.
    Software license: Open licenses are referenced, see above.
    Statement of need: Addressed on a very general level in Introduction, addressed somewhat in section 2.
    Purpose and scope/limits of the application: See detailed comments.
    Utility of the software: See detailed comments.
    User notes: See detailed comments.
    Sample data availability and reference to related datasets: Not that clear, embedded in quickstart notebook.
    Conflicts of interest: No Statement available.

Format:
    Code listings do not have captions, their line numbering conflicts with the line numbering of the overall document (maybe this is a preprint / template issue?).

Downloads

Download Preprint

Metadata
  • Published: 2023-10-11
  • Last Updated: 2024-07-22
  • License: Creative Commons Attribution 4.0
  • Subjects: Data Management Software
  • Keywords: Data management, metadata, HDF5, data lifecycle, Python, database
Versions
All Preprints