Comments
Invited Review Comment #185 Anonymous @ 2024-11-03 14:00
Summary:
The authors use a concrete experimental system as an example to show how data obtained from experiments can be published in a FAIR manner.
FAIR stands for Findable, Accessible, Interoperable, Reusable, and are guidelines that scientists/engineers are urged to follow when they publish their research data to the community [1].
As the authors note, these guidelines do not contain a prescription *how* to implement FAIRnes, and they propose a number of requirements on the (meta)data gathering and publication process that can help achieve this. They propose and implement a data collection framework and apply it to an experimental test environment where data are collected on physical objects using a variety of sensors.
The authors devise information models for the (meta)data describing the different elements in the experimental setup. They provide an implementation where instances of these models are stored on a publicly available GitLab repository, and they show some examples how to use these models.
The work is rigorous, and seems quite comprehensive, the models are detailed, and most of the relevant artefacts are freely accessible.
I do have some general comments and requests for clarification.
Regarding FAIRness.
The FAIR specification offers a number of guiding principles ([1], Box 2). The authors are not explicit how these are implemented in the currentt work. Instead, they present their own list of requirements, some of which are definitely close to the FAIR principles, some of which seem extensions to the specific domain (of which btw I am not an expert), some seem lacking in precise motivation. But it is up to the reader to make the. It would improve the understanding and strengthen their claim that they are applying the FAIR principles if they made the links between FAIR guidelines and their requirements explicit. This could include some explicit statements in case some of the FAIR principles may not have been implemented exactly or completely.
Regarding the list of requirements:
R1,2,3, 4 are integral to FAIR, but in the current text have not received explicit justification. It would be useful to give that here as well.
R7 seems a bit obvious and maybe empty of meaning, models are supposed to contain relevant quantities.
R10 this statement is unclear.
R14: does this mean any experimental setup should be describable with the provided model?
R16 Do the authors mean that the model itself should allow for this expansion, or that one should be able to change the model itself easily (say add new classes, attributes, relations)?
R17 How is version control to be combined with persistent identifiers? Will there be a different PID for different versions of the same experimental results?
R19: why is this a requirement? It seems an implementation detail.
Information models, standardization, interoperability
Section 3 presents an overview of the State of the Art for making data FAIR. It contains an overview of some of the relevant technologies such as RDF and a background on ontologies and information models. The authors state that the difference between the latter two is not that well defined, and go on to produce three information models, for Components, Substances and Sensors, as RDF graphs. With all the attention given to ontologies in this section it would be good if the authors could explain why they choose RDF iso OWL.
The authors state in line 233: “…information models …, which are based on existing ontologies.” It is unclear whether the authors mean to imply that the information models “just” use existing (RDF) vocabularies for annotating certain elements, say as rdf/type, or that they intend to use the class structure of existing ontologies. To that end it might be useful if the authors could provide the RDF graphs for representing the models also in a machine readable format (ttl, json, xml).
Currently the models are only available as diagrams in the paper. These diagrams are not uniform. The diagram for Components and one of the two diagrams for Sensors have some subdivisions which are missing for the model for Substances. The latter seems more comprehensive and similar to the model for Sensors in the appendix. Maybe all three models only need to be presented in that way. Especially since the assignment to these subdivisions seems somewhat inconsistent between the Components and Sensors models. E.g. UUID assigned to METADATA in one, PROPEPRTIES in the other.
Various instances of two of the models (Sensors and Substances) are available as RDF files in a GitLab repository, which is very useful. It would be nice if also examples of the third model for Components were available as well.
The authors make use of standardized vocabularies but add details to describe the specific environment they aim to model. It is generally understood that true interoperability between different data sets needs them to be described using standardized, “global models” (or schemas), at least within a common scientific domain. This should allow users to write code that can deal with the format and contents of each data set, facilitating comparison, joining etc. One could argue that FAIR guideline R1.3 “(meta)data meet domain-relevant community standards” hints at this. Do the authors think that their models are sufficiently standardized to enable this type of interoperability?
Does there exist an effort to provide such standardized models for the domain that relevant to the current paper?
Reference data
For the description of the Substances the authors state that some of the model elements might lead to very large files if serialized as RDF graphs in some of the standard formats (ttl, json, xml). Instead, they propose to use HDF5 files for storing this data and refer to this file from within the main metadata files.
Is there an accepted standard for linking to (elements inside) HDF5 files, or is this ad hoc? Should the HDF5 file correspond to some standard format as well?
And for the example the authors use and refer to in https://w3id.org/fst/resource/018dba9b-f067-7d3e-8a4d-d60cebd70a8a.ttl, it seems that the authors use data extracted from an external source. Were it not better if a reference could be given to that source (NIST Chemistry WebBook) iso using its contents in a separate copy, formatted in a bespoke manner?
Implementation
The authors have implemented the framework using GitLab as the data store. They have added a separate mechanism for providing persistent identifiers using two levels of redirection. This seems an interesting and valid approach. One thing that seems to be missing is FAIR guideline F4: “(meta)data are registered or indexed in a searchable resource”. Of course, the interpretation of that guideline might be upp for discussion, but are the authors planning to provide a semantic query interface into the contents of the metadata objects that could allow users to search for data sets of interest?
The authors provide links to code for various parts of the implementation, but sadly these did not resolve to accessible web locations. I identify some of these locations below. I think this code may be useful to interpret the description of CI/CD pipeline, which may be somewhat obscure without it and require quite some understanding of GitLab pages and CI/CD support, (html) redirects etc.
Applications
I like application 1. It gives a very nice example of the use of an IRI as a QR code on a physical object (a sensor).
I find application 2 somewhat confusing. In particular the introduction af a completely new data structure for a measurement in MATLAB. Indeed I agree with the authors that it would be good if graphs of the experiment could be provided. Would that necessitate a new model (for Measurements)?
Application 3 is rather generic and can be left out.
Conclusion
I like the work and I think with some minor revisions it can be accpted.
I hope that other researchers will be inspired to perform a similarly detailed publication of their research data. And it would be of great interest if others would engage with the framework presented here, for example by trying to use the same models for describing their (meta)data. This could lead to feedback on the models and possibly their evolution towards a domain-specific standard.
Minor comments and questions ref specific locations in the paper:
L17,18: This sentence is a bit wonky and I do not quite understand what is intended. Maybe rewrite.
Figure 1: this figure is very unclear and could be any graph as the text is hardly readable. I have been able to read some of the RDF graphs presented in the paper into python and visualize them using an interactive graph viewer, but it may be hard to improve the layout within a static page. As they also link to a visualizer (https://issemantic.net/rdf-visualizer.) the authors could consider leaving it out.
L49,50: This sentence is not well formed. Likely remove one of the two occurrences of “are investigated”.
L168 may want to link to RDF syntax specification:
https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/
L173 IRIs are identifiers of, not necessarily links to information.
L197 “marker” should be “maker”
L249: “When linking, only known semantic vocabulary from established ontologies is used.” should probably be “When linking, only known semantic vocabularies from established ontologies are used.”
L250:” … ontologies are summarised in the following table with its reference and its application…” should probably be “… ontologies are summarised in the following table with their reference and their application…”
Table 2: iso ‘prefix’ maybe use ‘URI’ in the header of the second column? If these are supposed to be hyperlinks, the entries for dcTerms, dcType and foaf did not resolve to a valid URL.
L300: is the boldface and capitalization of Origin intentional? Similar phrase is used in line 273 without boldface. Here it seems to be given a special meaning which is not explained.
L324/5: the link does not resolve to an existing page at the time of this review.
L372: the link in footnote 7 does not resolve to an existing page at the time of this review.
L380: “…necessitates updates…” should probably be “… would necessitate updates …”.
L425 typo “beeing”
L430: the link in footnote 9 does not resolve to an existing page at the time of this review.
L504: the links in the URL column of Table 3 do not resolve to an existing page at the time of this review.
[1] M. D. Wilkinson, M. Dumontier, I. J. J. Aalbersberg, et al., “The FAIR Guiding Principles for scientific data management and stewardship,” Scientific data, vol. 3, p. 160 018, 2016. DOI: 10.1038/sdata.2016.18.
Invited Review Comment #181 Anonymous @ 2024-10-25 17:27
Review of “How to Make Bespoke Experiments FAIR”
The premise of the article is that many scientific workflows are bespoke. The authors listed several use cases. In the case of their tensile test rig, they are wishing to capture information from three buckets: (1) sensors, which include instruments and the various detectors within, (2) components, which are auxiliary physical objects that add value to the experiment, and (3) substances, which could include the specimen itself, and also the surround media that is relevant to the experiment. The proposed schemas for organizing the three information buckets, talked a little bit about RDF and ontologies, and then finally unveiled their information model based on RDF concepts.
This is a very charming manuscript. They authors cover much ground, from tracking ambient experimental conditions, to samples, to the propagation of uncertainties. This is ambitious work and it is very much worthy of publication. However, as with any ambitious work, one might expect many criticisms. The questions and comments I have raised are located on the PDF itself. Because they are extensive, it will require significant effort for me to transcribe these into plain text. I have included a link here where you may retrieve the PDF with comments. While I don’t mind sharing with the authors all my comments, the link here is not anonymous, so I would like the editor to please download the article and then share it through your own mechanism.
In any case, a revised manuscript, assuming all of my concerns are addressed, will be a highly valuable contribution to the engineering community.