h5RDMtoolbox - A Python Toolbox for FAIR Data Management around HDF5

This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.


Download Preprint


Matthias Probst , Balazs Pritz


Sustainable data management is fundamental to efficient and successful scientific research. The FAIR principles (Findable, Accessible, Interoperable and Reusable) have been proven to be successful guidelines to enable comprehensible analysis, discovery and re-use. Although the topic has recently gained increasing awareness in both academia and industry, the engineering sciences in particular are lagging behind in managing the valuable asset of data. While large collaborations and research facilities have already implemented metadata strategies, smaller research groups and institutes are often missing a common strategy due to heterogeneous and rapidly changing environments as well as missing capacity or expertise. This paper presents an open-source package, called h5RDMtoolbox, written in Python helping to quickly implement and maintain FAIR research data management along the entire data lifecycle using HDF5 as the core file format. One of the key features of the toolbox is the flexible, high-level implementation of metadata standards, adaptable to the changing requirements of projects, collaborations and environments, such as experimental or computational setups. Implementation of existing schemas such as EngMeta or the cf-conventions are possible and intended use-cases. Other benefits of the toolbox include a simplified interface to the data and database solutions to query metadata stored in HDF5 files.


Data Management Software


Data management, metadata, HDF5, data lifecycle, Python, database


Published: 2023-10-11 01:40


Creative Commons Attribution 4.0

Add a Comment

You must log in to post a comment.


Comment #70 Matthias Probst @ 2023-11-22 00:59

To all reviewers, thank you very much for the constructive and helpful feedback. We will revise the manuscript and the code accordingly and respond to the individual comments as soon as possible.

Comment #69 Kevin Logan @ 2023-11-15 01:24

As topical editor of this submission, I want to thank the reviewers for the time and effort taken to compose the detailed and constructive feedback presented in the review comments. Following their recommendation, I request the authors to revise their manuscript so as to address the issues raised by the reviewers and resubmit the manuscript. Furthermore, the authors must submit a comment in response detailing how they addressed the points raised by the reviewers.

Invited Review Comment #68 Sangeetha Shankar @ 2023-11-14 08:23

Outcome: Revise


Detailed review:

Software descriptor:

The h5RDMtoolbox presented in this paper is a great initiative to enrich HDF5 files with metadata at all stages of the data lifecycle. The concept is generic and can theoretically be used in various research domains. In the introduction section of the software descriptor, the authors have explained the need for easier ways for FAIRification of data. However, more efforts are required in terms of stating the novelty and originality of the work as well as the state-of-the-art approaches on handling of HDF5 datasets; particularly, information on similar existing tools is missing.

The paper is well-structured and easy to follow. Links in the paper pointing to the software as well as the example are functional. The use of xarray package to read data from the HDF5 together with its metadata is a wise choice. On the other hand, the example provided in section 3.3 does not clearly convey the benefits of use of metadata/attributes. From figure 5, I assumed that the units of measurement are stored as attributes, which are fetched by h5RDMtoolbox when creating the plot (this assumption was confirmed to be true while going through the tutorial). The authors could improve the figure as well as its textual explanation. In this case, it would be helpful to show a snippet of the yaml file as well.

The article could summarize which aspects of FAIR the proposed solution intends to improve and how. This information is scattered throughout the paper and it would be beneficial to bring them together, for example, in the form of a table. The article could also contain a comparison of the functionalities of similar tools with h5RDMtoolbox. As h5RDMtoolbox extends h5py python library, the authors could provide a statement on whether there are any differences in performance while executing similar operations with h5RDMtoolbox and h5py and whether the size of the hdf5 files affects the performance of their toolbox.

Some colors used in Figure 1 may be hard to distinguish for persons with difficulties recognizing red/green colors. To make the figure easily readable for everyone, the authors could consider switching to a color-vision-deficiency-friendly color palette which can be defined using online tools.

Furthermore, I strongly recommend the authors to recheck the references list. Some items are incomplete, contain duplicate information, wrongly formatted or contain broken links. It is also suggested to add DOI of publications (or ISBN for books), wherever available.


Tool and its documentation:

I agree with Nils Preuß, that the authors have put in lot of effort into the development of the tool, which is evident from their clean and well-structured code as well as the documentation of the tool. The coding style is good and helps the reader easily understand the code. The code is sufficiently commented and tests are included. However, metadata on the software in a standardized format is missing.

The tutorial on getting started with the toolbox is functional and gives the users most of the basic information required to use their toolbox. On the contrary, it is not easy to understand how to use the toolbox for an existing HDF5 file created in a different context (I was attempting to use the toolbox to add metadata to an HDF5 file containing multi-sensor data from railway environment). The authors mention in the conclusion that they intend to test their toolbox with data from different scientific disciplines. However, the documentation does not provide sufficient guidance to the users on creating a new conventions (yaml) and new validators. I strongly recommended the authors to create a dedicated section in the documentation for this topic. Alongside that, the documentation could list the validators that are already defined in their toolbox and explain the validation they perform. Furthermore, the use of vocabularies plays a great role in making the datasets interoperable. It would be worthful, if the authors make a comment on the possible use of vocabularies in their solution and its need/role in sustainable development and reuse of their toolbox.

I also came across many spelling and grammatical mistakes in the documentation. The flow of contents in the documentation could be improved. Furthermore, the link in the readme file pointing to the documentation and examples is not working.

System requirements are missing in the documentation of the tool. If there are no specific system requirements, this can be explicitly mentioned. Also, it is unclear whether the tool was tested in different operating systems. The authors can make a statement on the working of the tool in different operating systems.


Comment on authorship:

The list of authors of the manuscript is not consistent with the list of developers of the software. In GitHub, there are contributions from a person with username ‘lucasbuettner’. This person is neither listed as an author of the paper nor mentioned in the acknowledgements section. The authors could add to GitHub a list of current and past contributors to the project. Conflict of interest of the authors is also missing in the paper.

Invited Review Comment #67 Anonymous @ 2023-11-12 08:58

The authors present a Python package that provides a high-level interface for working with HDF5 files and validating metadata  according to custom metadata conventions. The interface to HDF5 files is a thin wrapper around the h5py package and exposes a similar API, with one difference being that data arrays of the xarray package are returned when querying datasets. Beyond that, a query mechanism to search and filter datasets in individual hdf5 files, the local filesystem, or MongoDB database instances is provided. Regarding the above-mentioned metadata conventions, the authors introduced a file layout in yaml syntax that allows to define such conventions in a human- and machine-readable format. A convention can be loaded into the package, which will include metadata validation steps against the loaded convention in subsequent I/O operations from or to HDF5 files with the package.

I believe this software is a valuable contribution the scientific community and could be useful for researchers in various disciplines. Moreover, it fits well into the scope of the Ing.Grid Journal. However, before I can fully recommend this manuscript for submission, I would be grateful if the authors could address or respond to the following comments:

1. Unfortunately, attempting to run the quick start example was not without issues. Installation via pip fails with python3.11, I guess this is because the setup.cfg of the repository requires python < 3.11. However, the README only states “tested until 3.10”. Is python<3.11 a hard requirement or can it be loosened? If yes, I suggest to do so as there is already a python3.12 stable release. If not, the readme should probably be rephrased.

2. There also seems to be an issue with python3.10 on MacOS - I wasn’t able to reproduce the error on a machine that runs on Ubuntu22.04. On MacOS with python3.10, the quick start example raised an exception when loading the convention. After downgrading to Python3.9 it worked. If this can be confirmed, maybe it would be good to state this in the README such that users are not surprised when the quick start example does not work. By the way, I got the same error in the provided google collab notebook.

3. I think both the manuscript as well as the online documentation could use some language editing. I stumbled across a number of typos in the online documentation, as well as a few in the paper (I will list some them at the end of this comment). 

4. Two more things regarding the online documentation: (1) the link in the README in the first sentence of the “Documentation” section yields a 404, and (2) on the following page on the database features, there is a large code cell containing the traceback of a keyboard interrupt exception (https://h5rdmtoolbox.readthedocs.io/en/latest/database/h5mongo.html#first-things-first-connection-to-the-db). Is this intentional?

5. In several places in the online documentation, it seemed to me that there are escape characters that shouldn’t be rendered. For instance, here: https://h5rdmtoolbox.readthedocs.io/en/latest/database/Serverless.html#advances-dataset-searches, above the first code box. Shouldn’t this be simply “$eq” and “$gt”?

6. The README of the repository states that the toolbox supports “(4) sharing data,.. e.g. to repositories like Zenodo”. I was scanning the API documentation but I couldn’t find anything related to this. Does the package provide an interface to automatically create a Zenodo dataset from a local file system or MongoDB instance? If not, I suggest that this part of the README is rephrased. The same holds for “(5) reusing data”. I could only see how conventions are “reused” from zenodo, but not entire datasets. But besides this, I am not sure if such thing should actually be part of the responsibilities of this package. 

7. I suggest to add some more context to the part of the manuscript where the conventions are first introduced (section 3.1). The way I understand it, when a convention is loaded with `h5tbx.use(cv)`, the metadata of subsequent I/O operations with the toolbox is validated against this convention. I think it would help to state something like this before jumping down to details like “target_method”. The way I understand it is that when choosing “target_method: create_dataset” on an attribute, one can (for instance) make an attribute mandatory for every dataset that is created? As said, I think some more explanations in what sense the loaded convention affects the API and how it relates to the terms in the yaml file could be useful here.

8. I am not so sure to what extent the package addresses the aspect of “sharing” as stated in the manuscript, in particular in Section 3.4 “Sharing and re-using”. I think the reusing aspect is fine, as the toolbox provides query and filter mechanisms that I can use to explore the dataset of someone else. But I think that “sharing”, in particular in connection to the FAIR principles that are referenced in the manuscript, would rather refer to making the data publicly accessible somewhere behind a persistent identifier, etc. Maybe it is just a wording issue and I am misinterpreting this.

9. It would be nice if there was a concise summary of what it is that the package provides. In the “conclusion and outlook”, there are the three bullet points, but maybe it could help if such an overview was placed at the beginning of Section 3? This way, maybe the reader is more prepared for what is to come in 3.1-3.4.

10. Just a minor comment/question regarding the implementation of the support for MongoDB instances. I was a bit surprised to see code like `h5[“dataset”].mongo.insert(…)`. So the mongo accessor is part of the dataset interface? Wouldn’t it make sense to have a generic database interface that could be implemented for different backends (i.e. MongoDB, the local file system, or also other database system), and have that as a separate object and not part of the dataset interface? That way the dataset itself would be decoupled from the database, and one could provide backends for different database implementations without having to touch the dataset implementation (there could be a protocol on the API that a database implementation must fulfil to enable type checking).

A few of the language issues I stumbled across in the manuscript:

l. 97: “…, the toolbox requires a high-level interface…” -> the toolbox provides?

l.119: “It will fit most engineering data and proofs to be suitable to harmonize heterogeneous source data.” -> … proofs to be suitable to harmonise data from heterogeneous sources?

l.291: “…, the performance of finding data is not compatible with a dedicated system”. I am not sure what the authors are trying to say here. Maybe “… the performance of finding data is not comparable to that of a database system”?

l.321: ... “extensive documentation automatically created and published…” -> IS automatically created…?

Invited Review Comment #64 Nils Preuß @ 2023-11-06 06:25

Overall assessment: revise

The authors present a Python tool called h5RDMtoolbox Python for help with implementation and maintainenance of FAIR research data management along the data lifecycle using HDF5 as the core file format.
The authors clearly made an effort to ensure code quality, as well as to provide extensive documentation, usage examples and tutorials. However, the manuscript / software descriptor would profit from a more careful presentation, especially a discussion of how the software compares to other similar and related alternative software in terms of scope and provided functionality, as well as the software architecture and usage. While the document is well formatted and stylized, the writing / language in general could be improved.
More detailed comments are provided below.

Detailed comments (c.f. https://www.inggrid.org/site/reviewerguidelines/):

Statement of need:
    The introduction focuses a lot on the challenges and importance of metadata and research data management in a very general sense. Section 2 elaborates on typical workflow steps using the concept of the research data life cycle, however also in a very general sense, without references to literature, and more importantly, without reference to typical workflows and challenges the target audience may encounter.

Discussion of how the software compares to other similar and related alternative software in terms of scope and provided functionality:
    The article would benefit from keeping the contents of the current introduction a lot more concise, and introduce a typical workflow including its current challenges and current solutions (withour the presented tool) along the data lifecycle steps in section 2. This allows readers to understand the purpose and scope/limits of the tool, recognize the needs that the presented tool adresses much more clearly, as well as to compare the workflow that does use the tool presented in later sections.

Novelty of the approach:
    Clarification regarding the needs and/or current typical workflows including typical solutions or workarounds (as suggested above) would emphasise the novelty of the presented approach.
    The article would benefit from paragraph in section 3 (methodology) contrasting the presented solution for metadata conventions with competing popular approaches based on semantic web technologies or schemas. At least some assessment on how the approach addresses selected aspects of the FAIR principles would be much appreciated.

Software architecture and usage:
    The article is very unclear as to how a user of the tool interfaces with the introduced metadata conventions, or how the metadata validators are defined or implemented, as well as what their scope and limitations are.
    Which workflow steps / user interface features are metadata aware, in which way? My best guess is, the example dataset references h5 dimension scales (a hdf5 feature that allows automatic translation of dataset indices to numeric information, e.g. spatial or temporal, about the "grid" of the dataset) which are then described by standard metadata to provide information about physical quantities and units for plotting (or maybe consistency checks)?

Since the available documentation is very comprehensive and detailed already, the article could leverage this by addressing those open questions concisely where appropriate and pointing to more detailed explanations in the form of such supplementary material via a clear reference / link if needed.

Additional notes (c.f. https://www.inggrid.org/site/reviewerguidelines/)

Software Review Requirements:
    The software code is deposited to a repository. The code is open to be accessed, viewed, browsed and cloned anonymously.
    The repository contains a README file explaining installation procedure, dependencies as well as purpose, scope and usage. The installation procedure is automated via pip.   
    Instructions on how to contribute to the project is included in the repository. It references instructions regarding docstrings, docs and tests.
    The software submission does not depend on proprietary software.
    Installation instructions clearly and comprehensively detail all required dependencies, including required version numbers. Python specific standard automated dependency management tooling is used.
    The software code complies with formal standards regarding commenting, formatting conventions, coding standards and structure. It is easily readable and makes use of community or language-accepted common code structures.
    The repository uses version control.
    The repository allows issue tracking and submission of issues.
    The software adopts automated tests.

    A working example of the software is included via documentation of the software repository. Required minimal example data (convention file) is provided via external data repositories (zenodo), manual download and adaption of the software example is necessary due to changes in zenodos API (zsearch.search_doi() behaviour is affected by this).
    The software is NOT described by metadata using a formal, accessible, shared, and broadly applicable language for knowledge representation (cf. https: //codemeta.github.io/). Some metadata is provided however via the setup.cfg file.
    The software is licensed under an Open Source license (CC-BY). The repository contains a plain text licensing file. It includes copyright and licensing statements of third-party software (gitbutler?) that are legally bundled with the code in compliance with the conditions of those licenses. It does NOT however refer to the CC-BY license used for the submission. The github page refers to a MIT license.

Software Descriptor Requirements:
    Version number of the code: Missing
    State of the art and novelty of the approach: See detailed comments.
    Software license: Open licenses are referenced, see above.
    Statement of need: Addressed on a very general level in Introduction, addressed somewhat in section 2.
    Purpose and scope/limits of the application: See detailed comments.
    Utility of the software: See detailed comments.
    User notes: See detailed comments.
    Sample data availability and reference to related datasets: Not that clear, embedded in quickstart notebook.
    Conflicts of interest: No Statement available.

    Code listings do not have captions, their line numbering conflicts with the line numbering of the overall document (maybe this is a preprint / template issue?).