h5RDMtoolbox - A Python Toolbox for FAIR Data Management around HDF5

This is a Preprint and has not been peer reviewed. This is version 3 of this Preprint.

Authors

Matthias Probst  , Balazs Pritz

Abstract

Sustainable data management is fundamental to efficient and successful scientific research. The FAIR principles (Findable, Accessible, Interoperable and Reusable) have been proven to be successful guidelines to enable comprehensible analysis, discovery and re-use. Although the topic has recently gained increasing awareness in both academia and industry, the engineering sciences in particular are lagging behind in managing the valuable asset of data. While large collaborations and research facilities have already implemented metadata strategies, smaller research groups and institutes are often missing a common strategy due to heterogeneous and rapidly changing environments as well as missing capacity or expertise. This paper presents an open-source package, called h5RDMtoolbox, written in Python helping to quickly implement and maintain FAIR research data management along the entire data lifecycle using HDF5 as the core file format. One of the key features of the toolbox is the flexible, high-level implementation of metadata standards, adaptable to the changing requirements of projects, collaborations and environments, such as experimental or computational setups. Implementation of existing schemas such as EngMeta or the cf-conventions are possible and intended use-cases. Other benefits of the toolbox include a simplified interface to the data and database solutions to query metadata stored in HDF5 files.

Comments

Comment #82 Kevin Logan @ 2023-12-22 11:41

Dear Authors,
Thank you for your efforts towards improving your software package and the software descriptor, and your extensive response to the issues raised by the reviewers. I have invited all three reviewers to reassess your submitted software package and software descriptor in view of the changes and your response. A new decision on your manuscript will be made based on the new feedback of the reviewers.

Comment #80 Matthias Probst @ 2023-12-20 16:42

I forgot to comment on data availability: There is no clear source of data for this software. We are shipping most of the (example/test)data with the GitHub repository, some examples use data downloaded from Zenodo. My best guess is, that I provide the GitHub-repository link.

Comment #79 Matthias Probst @ 2023-12-20 16:39

The new version of the manuscript uses a different structure to enhance the readability and understanding of the scope and capabilities of the toolbox:
1. A section “scope and related work” is added. It outlines existing solutions to manage HDF5 files and stating the need for a holistic approach including all aspects of working with HDF5 in the context of a lifecycle of research data
2. The section 3 is now the main part of the manuscript. It explains the concept and architecture of the package. It does this not by aligning to the lifecycle but rather to the sub-packages of the h5rdmtoolbox. This is a cleaner way and results of some refactoring of the toolbox based on the helpful reviewer comments about the code. Moreover, the new structure avoids jumping back and forth as it was the case in version 1 of the manuscript, where the lifecycle was the basis for explaining the toolbox. However, the data lifecycle is still important and the paragraphs of the sub-packages state their relevance in the cycle and for their contribution to the FAIRness of HDF5 files.
3. The references list got updated as requested
4. The authorship got clarified by thanking the user registered in github repository and adding him to the codemeta file
5. Some of the explanations are outsourced to the documentation as they might get too confusing in the written text, and need direct experience (possible with the used Jupyter Notebooks). However, minimal examples are provided as code examples, most of the time in combination with a figure.
6. Code listings now have captions. The issue with the overlaying numbers seems to be an issue with the template.
7. A section about limitations, including the remark on possible performance issues, is added.
8. Figures got updated. The color-issue got solved by using white font in figure 1 (unfortunately after uploading the new version, I saw, that only one text took the color. I will update this together with a new version)
9. I am not sure about the remark stated by Nils Preuß concerning the license. Is there a conflict between the software license (MIT) and the publication? If yes, I would be glad to get advice on how to resolve it.
If we missed something, we are happy to update the manuscript and the documentation (the latter is of course constantly evolving. At the current state, we are focusing on improving the structure and readability rather than adding new features)

Comment #76 Matthias Probst @ 2023-12-18 20:12

Dear reviewers,

I would like to comment on the issues raised about the implementations first, before commenting on the issues raised about the manuscript.

As of today, a new version (v1.0.0) is released and is available on pip. The documentation is also updated according to the changes listed below. We tried to be more concise here, added more information about architectural details. Broken links were fixed. Fixing language editing in the documentation is work in progress.


The major changes as a consequence of the helpful and constructive feedback are:

• The architecture of the toolbox got updated: It now has 5 dedicated modules or sub-packages. Only the wrapper (around HDF5) is using the convention module (c.f. https://h5rdmtoolbox.readthedocs.io/en/latest/_images/h5tbx_modules.svg)

• The toolbox is now better maintainable and allows adding new databases and repository interfaces beyond the one implemented. This is achieved by a clear object-oriented approach, using abstract classes defining the “rules” of the interfaces.

• The issue with the Zenodo API is resolved and the code is put into the toolbox code and not part of an external package anymore.

• The repository module explicitly supports uploading and downloading from repositories and promoting the aspect of sharing.

• A solution to assign persistent identifiers to HDF5 attributes is added (IRI). This allows to achieve better interoperability.

• The formal description of the package is improved by adding the file codemeta.json.

• Extensive testing for all current Python version starting at 3.8 and for all operating systems is included in the test pipeline.

• The documentation is updated according to the changes.


Detailed comments with references to the review comments:

• A codemeta.json file was added and is now described formally (@#64: a version number was included in the setup.cfg file as common practice for python packages and is now also available in the codemeta file. Should I be missing another place to put a version, I am happy to fix that.)

• The package is now tested against python 3.8 until 3.12 for mac, windows and linux (see https://github.com/matthiasprobst/h5RDMtoolbox/actions)
This should solve the issues raised in comment #67 and #68 concerning system requirements and testing

• Issues with the documentation and broken links as mentioned in #67 (point 4. And 5.) are fixed

• Concerning the Zenodo api in comment #67 and #64: Zenodo.org changed the API. The full dependency on the package zenodo_search was resolved by making the code part of the software and the communication to the Zenodo api was updated. This led to two outcomes: 1. Files can be downloaded from Zenodo again and must not manually be downloaded. 2. The module/sub-package “repository” has been added, which provides an abstract interface class. One such realization is the Zenodo interface. Hence, additional interfaces can be implemented by others without touching the rest of the code. Reference to documentation: https://h5rdmtoolbox.readthedocs.io/en/latest/repository/zenodo.html
The interface to Zenodo is explicitly part of the toolbox. It allows users to upload and download HDF5 files (see https://h5rdmtoolbox.readthedocs.io/en/latest/repository/zenodo.html).
I hope, that this answers also the point 6 in comment #67.
I will address the repository-part and modular design in the manuscript, too.

• Referring to point 8 in comment #67: I agree, that the aspect of sharing was not backed up strong enough. Through the above described implementation of the module “repository”, it hopefully gets a bit clearer. The explicit implementation of repository solutions by means of an abstract class, which enforces the provision of a DOI should fulfill the aspect of “sharing”.

• I added the option to assign HDF attributes to IRIs: https://h5rdmtoolbox.readthedocs.io/en/latest/convention/ontologies.html). This was not possible before and to my knowledge, no other HDF-related software supports it. With this, I try to further strengthen the FAIRness of HDF5 files, especially the aspect of interoperability. I hope to adequately answer the comment in #68 about the use of vocabularies here. I will add a part in the manuscript, too.

• Thanks to the helpful comment in #67 concerning the implementation of the databases: Just like the repository module, the database was refactored such, that it is now very modular: New database interfaces can be easily added. Again, an abstract class is setting the “rules” for future implementations and successful work with HDF5 files. Link to documentation: https://h5rdmtoolbox.readthedocs.io/en/latest/database/index.html

• A comment on performance, as mentioned in comment #68: The primary goal of the toolbox is achieving richly described HDF5 files and to provide tools to interface with databases and repositories. This adds additional overhead and certainly makes working with HDF5 not faster – but it makes it FAIRer. The toolbox does not limit the underlying package h5py and therefore capabilities of working with HDF5 through Python. Certainly, the I/O gets a bit slower due to interface with xarray and the additional metadata validations. I was not planning on conducting speed tests. However, I will mention the scope but also limitations in this respect in the new version of the manuscript. Would something like a speed test desired?


The next comment will be on the new version of the manuscript and on the remaining remarks in the comments.

Comment #70 Matthias Probst @ 2023-11-22 07:59

To all reviewers, thank you very much for the constructive and helpful feedback. We will revise the manuscript and the code accordingly and respond to the individual comments as soon as possible.

Comment #69 Kevin Logan @ 2023-11-15 08:24

As topical editor of this submission, I want to thank the reviewers for the time and effort taken to compose the detailed and constructive feedback presented in the review comments. Following their recommendation, I request the authors to revise their manuscript so as to address the issues raised by the reviewers and resubmit the manuscript. Furthermore, the authors must submit a comment in response detailing how they addressed the points raised by the reviewers.

Invited Review Comment #68 Sangeetha Shankar @ 2023-11-14 15:23

Outcome: Revise

 

Detailed review:

Software descriptor:

The h5RDMtoolbox presented in this paper is a great initiative to enrich HDF5 files with metadata at all stages of the data lifecycle. The concept is generic and can theoretically be used in various research domains. In the introduction section of the software descriptor, the authors have explained the need for easier ways for FAIRification of data. However, more efforts are required in terms of stating the novelty and originality of the work as well as the state-of-the-art approaches on handling of HDF5 datasets; particularly, information on similar existing tools is missing.

The paper is well-structured and easy to follow. Links in the paper pointing to the software as well as the example are functional. The use of xarray package to read data from the HDF5 together with its metadata is a wise choice. On the other hand, the example provided in section 3.3 does not clearly convey the benefits of use of metadata/attributes. From figure 5, I assumed that the units of measurement are stored as attributes, which are fetched by h5RDMtoolbox when creating the plot (this assumption was confirmed to be true while going through the tutorial). The authors could improve the figure as well as its textual explanation. In this case, it would be helpful to show a snippet of the yaml file as well.

The article could summarize which aspects of FAIR the proposed solution intends to improve and how. This information is scattered throughout the paper and it would be beneficial to bring them together, for example, in the form of a table. The article could also contain a comparison of the functionalities of similar tools with h5RDMtoolbox. As h5RDMtoolbox extends h5py python library, the authors could provide a statement on whether there are any differences in performance while executing similar operations with h5RDMtoolbox and h5py and whether the size of the hdf5 files affects the performance of their toolbox.

Some colors used in Figure 1 may be hard to distinguish for persons with difficulties recognizing red/green colors. To make the figure easily readable for everyone, the authors could consider switching to a color-vision-deficiency-friendly color palette which can be defined using online tools.

Furthermore, I strongly recommend the authors to recheck the references list. Some items are incomplete, contain duplicate information, wrongly formatted or contain broken links. It is also suggested to add DOI of publications (or ISBN for books), wherever available.

 

Tool and its documentation:

I agree with Nils Preuß, that the authors have put in lot of effort into the development of the tool, which is evident from their clean and well-structured code as well as the documentation of the tool. The coding style is good and helps the reader easily understand the code. The code is sufficiently commented and tests are included. However, metadata on the software in a standardized format is missing.

The tutorial on getting started with the toolbox is functional and gives the users most of the basic information required to use their toolbox. On the contrary, it is not easy to understand how to use the toolbox for an existing HDF5 file created in a different context (I was attempting to use the toolbox to add metadata to an HDF5 file containing multi-sensor data from railway environment). The authors mention in the conclusion that they intend to test their toolbox with data from different scientific disciplines. However, the documentation does not provide sufficient guidance to the users on creating a new conventions (yaml) and new validators. I strongly recommended the authors to create a dedicated section in the documentation for this topic. Alongside that, the documentation could list the validators that are already defined in their toolbox and explain the validation they perform. Furthermore, the use of vocabularies plays a great role in making the datasets interoperable. It would be worthful, if the authors make a comment on the possible use of vocabularies in their solution and its need/role in sustainable development and reuse of their toolbox.

I also came across many spelling and grammatical mistakes in the documentation. The flow of contents in the documentation could be improved. Furthermore, the link in the readme file pointing to the documentation and examples is not working.

System requirements are missing in the documentation of the tool. If there are no specific system requirements, this can be explicitly mentioned. Also, it is unclear whether the tool was tested in different operating systems. The authors can make a statement on the working of the tool in different operating systems.

 

Comment on authorship:

The list of authors of the manuscript is not consistent with the list of developers of the software. In GitHub, there are contributions from a person with username ‘lucasbuettner’. This person is neither listed as an author of the paper nor mentioned in the acknowledgements section. The authors could add to GitHub a list of current and past contributors to the project. Conflict of interest of the authors is also missing in the paper.

Invited Review Comment #67 Anonymous @ 2023-11-12 15:58

The authors present a Python package that provides a high-level interface for working with HDF5 files and validating metadata  according to custom metadata conventions. The interface to HDF5 files is a thin wrapper around the h5py package and exposes a similar API, with one difference being that data arrays of the xarray package are returned when querying datasets. Beyond that, a query mechanism to search and filter datasets in individual hdf5 files, the local filesystem, or MongoDB database instances is provided. Regarding the above-mentioned metadata conventions, the authors introduced a file layout in yaml syntax that allows to define such conventions in a human- and machine-readable format. A convention can be loaded into the package, which will include metadata validation steps against the loaded convention in subsequent I/O operations from or to HDF5 files with the package.


I believe this software is a valuable contribution the scientific community and could be useful for researchers in various disciplines. Moreover, it fits well into the scope of the Ing.Grid Journal. However, before I can fully recommend this manuscript for submission, I would be grateful if the authors could address or respond to the following comments:


1. Unfortunately, attempting to run the quick start example was not without issues. Installation via pip fails with python3.11, I guess this is because the setup.cfg of the repository requires python < 3.11. However, the README only states “tested until 3.10”. Is python<3.11 a hard requirement or can it be loosened? If yes, I suggest to do so as there is already a python3.12 stable release. If not, the readme should probably be rephrased.


2. There also seems to be an issue with python3.10 on MacOS - I wasn’t able to reproduce the error on a machine that runs on Ubuntu22.04. On MacOS with python3.10, the quick start example raised an exception when loading the convention. After downgrading to Python3.9 it worked. If this can be confirmed, maybe it would be good to state this in the README such that users are not surprised when the quick start example does not work. By the way, I got the same error in the provided google collab notebook.


3. I think both the manuscript as well as the online documentation could use some language editing. I stumbled across a number of typos in the online documentation, as well as a few in the paper (I will list some them at the end of this comment). 


4. Two more things regarding the online documentation: (1) the link in the README in the first sentence of the “Documentation” section yields a 404, and (2) on the following page on the database features, there is a large code cell containing the traceback of a keyboard interrupt exception (https://h5rdmtoolbox.readthedocs.io/en/latest/database/h5mongo.html#first-things-first-connection-to-the-db). Is this intentional?


5. In several places in the online documentation, it seemed to me that there are escape characters that shouldn’t be rendered. For instance, here: https://h5rdmtoolbox.readthedocs.io/en/latest/database/Serverless.html#advances-dataset-searches, above the first code box. Shouldn’t this be simply “$eq” and “$gt”?


6. The README of the repository states that the toolbox supports “(4) sharing data,.. e.g. to repositories like Zenodo”. I was scanning the API documentation but I couldn’t find anything related to this. Does the package provide an interface to automatically create a Zenodo dataset from a local file system or MongoDB instance? If not, I suggest that this part of the README is rephrased. The same holds for “(5) reusing data”. I could only see how conventions are “reused” from zenodo, but not entire datasets. But besides this, I am not sure if such thing should actually be part of the responsibilities of this package. 


7. I suggest to add some more context to the part of the manuscript where the conventions are first introduced (section 3.1). The way I understand it, when a convention is loaded with `h5tbx.use(cv)`, the metadata of subsequent I/O operations with the toolbox is validated against this convention. I think it would help to state something like this before jumping down to details like “target_method”. The way I understand it is that when choosing “target_method: create_dataset” on an attribute, one can (for instance) make an attribute mandatory for every dataset that is created? As said, I think some more explanations in what sense the loaded convention affects the API and how it relates to the terms in the yaml file could be useful here.


8. I am not so sure to what extent the package addresses the aspect of “sharing” as stated in the manuscript, in particular in Section 3.4 “Sharing and re-using”. I think the reusing aspect is fine, as the toolbox provides query and filter mechanisms that I can use to explore the dataset of someone else. But I think that “sharing”, in particular in connection to the FAIR principles that are referenced in the manuscript, would rather refer to making the data publicly accessible somewhere behind a persistent identifier, etc. Maybe it is just a wording issue and I am misinterpreting this.


9. It would be nice if there was a concise summary of what it is that the package provides. In the “conclusion and outlook”, there are the three bullet points, but maybe it could help if such an overview was placed at the beginning of Section 3? This way, maybe the reader is more prepared for what is to come in 3.1-3.4.


10. Just a minor comment/question regarding the implementation of the support for MongoDB instances. I was a bit surprised to see code like `h5[“dataset”].mongo.insert(…)`. So the mongo accessor is part of the dataset interface? Wouldn’t it make sense to have a generic database interface that could be implemented for different backends (i.e. MongoDB, the local file system, or also other database system), and have that as a separate object and not part of the dataset interface? That way the dataset itself would be decoupled from the database, and one could provide backends for different database implementations without having to touch the dataset implementation (there could be a protocol on the API that a database implementation must fulfil to enable type checking).


A few of the language issues I stumbled across in the manuscript:


l. 97: “…, the toolbox requires a high-level interface…” -> the toolbox provides?

l.119: “It will fit most engineering data and proofs to be suitable to harmonize heterogeneous source data.” -> … proofs to be suitable to harmonise data from heterogeneous sources?

l.291: “…, the performance of finding data is not compatible with a dedicated system”. I am not sure what the authors are trying to say here. Maybe “… the performance of finding data is not comparable to that of a database system”?

l.321: ... “extensive documentation automatically created and published…” -> IS automatically created…?

Invited Review Comment #64 Nils Preuß @ 2023-11-06 13:25

Overall assessment: revise

The authors present a Python tool called h5RDMtoolbox Python for help with implementation and maintainenance of FAIR research data management along the data lifecycle using HDF5 as the core file format.
The authors clearly made an effort to ensure code quality, as well as to provide extensive documentation, usage examples and tutorials. However, the manuscript / software descriptor would profit from a more careful presentation, especially a discussion of how the software compares to other similar and related alternative software in terms of scope and provided functionality, as well as the software architecture and usage. While the document is well formatted and stylized, the writing / language in general could be improved.
More detailed comments are provided below.



Detailed comments (c.f. https://www.inggrid.org/site/reviewerguidelines/):

Statement of need:
    The introduction focuses a lot on the challenges and importance of metadata and research data management in a very general sense. Section 2 elaborates on typical workflow steps using the concept of the research data life cycle, however also in a very general sense, without references to literature, and more importantly, without reference to typical workflows and challenges the target audience may encounter.

Discussion of how the software compares to other similar and related alternative software in terms of scope and provided functionality:
    The article would benefit from keeping the contents of the current introduction a lot more concise, and introduce a typical workflow including its current challenges and current solutions (withour the presented tool) along the data lifecycle steps in section 2. This allows readers to understand the purpose and scope/limits of the tool, recognize the needs that the presented tool adresses much more clearly, as well as to compare the workflow that does use the tool presented in later sections.

Novelty of the approach:
    Clarification regarding the needs and/or current typical workflows including typical solutions or workarounds (as suggested above) would emphasise the novelty of the presented approach.
    The article would benefit from paragraph in section 3 (methodology) contrasting the presented solution for metadata conventions with competing popular approaches based on semantic web technologies or schemas. At least some assessment on how the approach addresses selected aspects of the FAIR principles would be much appreciated.

Software architecture and usage:
    The article is very unclear as to how a user of the tool interfaces with the introduced metadata conventions, or how the metadata validators are defined or implemented, as well as what their scope and limitations are.
    Which workflow steps / user interface features are metadata aware, in which way? My best guess is, the example dataset references h5 dimension scales (a hdf5 feature that allows automatic translation of dataset indices to numeric information, e.g. spatial or temporal, about the "grid" of the dataset) which are then described by standard metadata to provide information about physical quantities and units for plotting (or maybe consistency checks)?


Since the available documentation is very comprehensive and detailed already, the article could leverage this by addressing those open questions concisely where appropriate and pointing to more detailed explanations in the form of such supplementary material via a clear reference / link if needed.



Additional notes (c.f. https://www.inggrid.org/site/reviewerguidelines/)

Software Review Requirements:
    The software code is deposited to a repository. The code is open to be accessed, viewed, browsed and cloned anonymously.
    The repository contains a README file explaining installation procedure, dependencies as well as purpose, scope and usage. The installation procedure is automated via pip.   
    Instructions on how to contribute to the project is included in the repository. It references instructions regarding docstrings, docs and tests.
    The software submission does not depend on proprietary software.
    Installation instructions clearly and comprehensively detail all required dependencies, including required version numbers. Python specific standard automated dependency management tooling is used.
    The software code complies with formal standards regarding commenting, formatting conventions, coding standards and structure. It is easily readable and makes use of community or language-accepted common code structures.
    The repository uses version control.
    The repository allows issue tracking and submission of issues.
    The software adopts automated tests.

    A working example of the software is included via documentation of the software repository. Required minimal example data (convention file) is provided via external data repositories (zenodo), manual download and adaption of the software example is necessary due to changes in zenodos API (zsearch.search_doi() behaviour is affected by this).
    The software is NOT described by metadata using a formal, accessible, shared, and broadly applicable language for knowledge representation (cf. https: //codemeta.github.io/). Some metadata is provided however via the setup.cfg file.
    The software is licensed under an Open Source license (CC-BY). The repository contains a plain text licensing file. It includes copyright and licensing statements of third-party software (gitbutler?) that are legally bundled with the code in compliance with the conditions of those licenses. It does NOT however refer to the CC-BY license used for the submission. The github page refers to a MIT license.

Software Descriptor Requirements:
    Version number of the code: Missing
    State of the art and novelty of the approach: See detailed comments.
    Software license: Open licenses are referenced, see above.
    Statement of need: Addressed on a very general level in Introduction, addressed somewhat in section 2.
    Purpose and scope/limits of the application: See detailed comments.
    Utility of the software: See detailed comments.
    User notes: See detailed comments.
    Sample data availability and reference to related datasets: Not that clear, embedded in quickstart notebook.
    Conflicts of interest: No Statement available.

Format:
    Code listings do not have captions, their line numbering conflicts with the line numbering of the overall document (maybe this is a preprint / template issue?).

Downloads

Download Preprint

Metadata
  • Published: 2023-10-11
  • Last Updated: 2024-01-02
  • License: Creative Commons Attribution 4.0
  • Subjects: Data Management Software
  • Keywords: Data management, metadata, HDF5, data lifecycle, Python, database
Versions
All Preprints