Skip to main content


How to Make Bespoke Experiments FAIR: Modular Dynamic Semantic Digital Twin and Open Source Information Infrastructure

This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.

Authors

Manuel Rexer  , Nils Preuß, Sebastian Neumeier, Peter F. Pelz

Abstract

In this study, we apply the FAIR principles to enhance data management within a modular test environment. By focusing on experimental data collected with various measuring equipment, we develop and implement tailored information models of physical objectes used in the experiments. These models are based on the Resource Description Framework (RDF) and ontologies. Our objectives are to improve data searchability and usability, ensure data traceability, and facilitate comparisons across studies. The practical application of these models results in semantically enriched, detailed digital representations of physical objects, demonstrating significant advancements in data processing efficiency and metadata management reliability. By integrating persistent identifiers to link real-world and digital descriptions, along with standardized vocabularies, we address challenges related to data interoperability and reusability in scientific research. This paper highlights the benefits of adopting FAIR principles and RDF for linked data proposing potential expansions for broader experimental applications., Our approach aims to accelerate innovation and enhance the scientific community’s ability to manage complex datasets effectively.

Comments

Comment #195 Manuel Rexer @ 2025-01-20 05:53

Dear Editor,

we are pleased to resubmit our revised manuscript, “How to Make Bespoke Experiments FAIR: Modular Dynamic Semantic Digital Twin and Open Source Information Infrastructure”, for consideration in ing.grid. We are deeply grateful for the constructive feedback provided by the reviewers, which has greatly helped us improve the quality of our work.
In response to the reviewers’ comments and questions, we have carefully addressed each point in detail. Please find our comprehensive replies to the reviewers’ comments attached along with the revised manuscript.
We appreciate your time and effort in overseeing the review process and hope that the revisions meet your and the reviewers' expectations. Please do not hesitate to contact us if additional clarifications are needed.
Thank you again for the opportunity to improve and resubmit our work.

Sincerely,
Manuel Rexer, Nils Preuß, Sebastian Neumeier and Peter Pelz


Replies to Review #1

Thank you for your detailed and helpful comments and questions. All minor revisions are integrated in our manuscript. We also updated and published all our links to software repositories even if some of them still need a refactoring. Please find in addition detailed answers to your questions and comments below.

Regarding FAIRness.
The FAIR specification offers a number of guiding principles ([1], Box 2). The authors are not explicit how these are implemented in the current work. Instead, they present their own list of requirements, some of which are definitely close to the FAIR principles, some of which seem extensions to the specific domain (of which btw I am not an expert), some seem lacking in precise motivation. But it is up to the reader to make the. It would improve the understanding and strengthen their claim that they are applying the FAIR principles if they made the links between FAIR guidelines and their requirements explicit. This could include some explicit statements in case some of the FAIR principles may not have been implemented exactly or completely.
Re: Thank you very much for your feedback. The explicit implementation of the sub item of the FAIR-principles or their discussion wasn’t our goal in this publication. We’ve added a section to clear things up why we didn’t address or discuss those sub item of the FAIR-principles in more detail.
Regarding the list of requirements:
R1,2,3, 4 are integral to FAIR, but in the current text have not received explicit justification. It would be useful to give that here as well.
Re: See above.
R7 seems a bit obvious and maybe empty of meaning, models are supposed to contain relevant quantities.
Re: Thank you for your feedback. We wanted a model that can be used in data acquisition software as well as postprocessing, like sensor characteristics. Therefore, it must contain the relevant quantities for experiment and not only relevant metadata
R10 this statement is unclear.
Re: Thank you for your comment we revised the requirement.
R14: does this mean any experimental setup should be describable with the provided model?
Re: Thank you for this question. Our intention was to have a universal approach, but of course we had the experiments conducted in our labs in mind, which are mostly mechanical setups. We have rewritten the point to make it better understandable.
R16 Do the authors mean that the model itself should allow for this expansion, or that one should be able to change the model itself easily (say add new classes, attributes, relations)?
second one?!?
Re: Thank you for your question. We aimed to build information models that allow for this expansion itself and added a section that describes this closer. Also we didn’t describe more background information about the information models in the paper because they are still work in progress and might change. We’ve just presented the models we used to do our first tests for connected data. More sophisticated information about the information models and further topics will follow up in possible future publication. For example, how the information models are constructed, how they depend on each other and how we will test data sets against SHACL-profiles whether they conform to those information models.
R17 How is version control to be combined with persistent identifiers? Will there be a different PID for different versions of the same experimental results?
Re: Thank you for being attentive, alert to detail and asking this question. Currently the version control is not combined with the persistent identifiers but that’s something we want to do in the closer future. We already have at least two really raw ideas how we could achieve this but haven’t evaluated them yet. Therefore, we sadly can’t give you more information on that topic for now and will look forward to another publication with you.
R19: why is this a requirement? It seems an implementation detail.
Re: Thank you for your question. From our experience it is essential to collect as much data/metadata automatically as possible. We tried to explain that in L148-L1351(revised version).
Information models, standardization, interoperability
Section 3 presents an overview of the State of the Art for making data FAIR. It contains an overview of some of the relevant technologies such as RDF and a background on ontologies and information models. The authors state that the difference between the latter two is not that well defined, and go on to produce three information models, for Components, Substances and Sensors, as RDF graphs. With all the attention given to ontologies in this section it would be good if the authors could explain why they choose RDF iso OWL.
Re: Thank you for your feedback. We added a small section at the end of the ‘3 State of the Art and Relevant Standards’ chapter on why we have chosen plain RDF over RDFS, SKOS or OWL.
The authors state in line 233: “…information models …, which are based on existing ontologies.” It is unclear whether the authors mean to imply that the information models “just” use existing (RDF) vocabularies for annotating certain elements, say as rdf/type, or that they intend to use the class structure of existing ontologies. To that end it might be useful if the authors could provide the RDF graphs for representing the models also in a machine readable format (ttl, json, xml).
Re: Thank you for pointing this out. We intend to do both and revised the section to try to clear things up.
Currently the models are only available as diagrams in the paper. These diagrams are not uniform. The diagram for Components and one of the two diagrams for Sensors have some subdivisions which are missing for the model for Substances. The latter seems more comprehensive and similar to the model for Sensors in the appendix. Maybe all three models only need to be presented in that way. Especially since the assignment to these subdivisions seems somewhat inconsistent between the Components and Sensors models. E.g. UUID assigned to METADATA in one, PROPEPRTIES in the other.
Re: Thank you for this suggestion. The diagrams are intended to elucidate the underlying principles of the various information models. Given that these models are distinct, there is a lack of consistency in the terminology employed. To address this, an additional paragraph has been included to clarify that there are three distinct models (L258-L260, revised version).
Various instances of two of the models (Sensors and Substances) are available as RDF files in a GitLab repository, which is very useful. It would be nice if also examples of the third model for Components were available as well.
Re: Thank you for your comment. Due to legal restrictions, we are unfortunately not able to provide all our component models. Nevertheless, to improve the understandability we published an information model of a self-developed component and added the link in the paper (L286-L289, revised version).
The authors make use of standardized vocabularies but add details to describe the specific environment they aim to model. It is generally understood that true interoperability between different data sets needs them to be described using standardized, “global models” (or schemas), at least within a common scientific domain. This should allow users to write code that can deal with the format and contents of each data set, facilitating comparison, joining etc. One could argue that FAIR guideline R1.3 “(meta)data meet domain-relevant community standards” hints at this. Do the authors think that their models are sufficiently standardized to enable this type of interoperability?
Re: Thank you for your question. It is not intended to develop a standard even though it would be possible. Currently activities are ongoing to enroll the approach to several other test rigs in the institution and thereby further develop and improve the models.
Does there exist an effort to provide such standardized models for the domain that relevant to the current paper?
Re: No, we are currently not working on standardized models.
Reference data
For the description of the Substances the authors state that some of the model elements might lead to very large files if serialized as RDF graphs in some of the standard formats (ttl, json, xml). Instead, they propose to use HDF5 files for storing this data and refer to this file from within the main metadata files.
Is there an accepted standard for linking to (elements inside) HDF5 files, or is this ad hoc?
Re: Thank you very much for the question. It is currently not standardized but might be the subject of future efforts of the scientific community.
Should the HDF5 file correspond to some standard format as well?
Re: Also thank you for this follow up question. In short, could be. The main issue about the HDF5 files inside a git repository is that every time someone needs data from the file, the whole file needs to be downloaded. One of the strengths of the HDF5 file is its support for partial loading, which we would also like a web service to be capable of, to be able to load small portions of a ‘file’ without the need to download the complete file.
And for the example the authors use and refer to in https://w3id.org/fst/resource/018dba9b-f067-7d3e-8a4d-d60cebd70a8a.ttl, it seems that the authors use data extracted from an external source. Were it not better if a reference could be given to that source (NIST Chemistry WebBook) iso using its contents in a separate copy, formatted in a bespoke manner?
Re: Thank you for asking that question. Yes, we also think that pointing to a persistent web location for the data would be a better approach but there is no obvious web API of the NIST Chemistry WebBook we could have used. Also it would be highly questionable if that web API would stay persistent. A solution could be a RDF compatible webservice for HDF5 files like already mentioned and briefly described in the last two questions.
Implementation
The authors have implemented the framework using GitLab as the data store. They have added a separate mechanism for providing persistent identifiers using two levels of redirection. This seems an interesting and valid approach. One thing that seems to be missing is FAIR guideline F4: “(meta)data are registered or indexed in a searchable resource”. Of course, the interpretation of that guideline might be upp for discussion, but are the authors planning to provide a semantic query interface into the contents of the metadata objects that could allow users to search for data sets of interest?
Re: Thank you for this question. That are further improvements. At the current state we won’t provide a SPARQL endpoint or something similar for the datasets. The goal of this publication was to build an infrastructure without any selfhosted services (provided a GitLab Runner is somehow accessible) to somehow make the datasets persistently, publicly available and not to index them in a database.
The authors provide links to code for various parts of the implementation, but sadly these did not resolve to accessible web locations. I identify some of these locations below. I think this code may be useful to interpret the description of CI/CD pipeline, which may be somewhat obscure without it and require quite some understanding of GitLab pages and CI/CD support, (html) redirects etc.
Re: Thank you for your comment. We worked on that and made everything publicly available, but not all of the code is perfectly commented, refactored and documented yet because of overlapping deadlines and the resulting lack of time.
Applications
I find application 2 somewhat confusing. In particular the introduction of a completely new data structure for a measurement in MATLAB. Indeed I agree with the authors that it would be good if graphs of the experiment could be provided. Would that necessitate a new model (for Measurements)?
Re: Thank you for your question. We are working on that, and this will include a new model for the contents of the measurement files itself.
Application 3 is rather generic and can be left out.
Re: Thank you for your feedback. Although this example seems not that big of a deal it is very relevant for our investigations to be able to use relevant and different kinds of uncertainty information. Therefore, we will keep this example but revised this application to point out the improvements in uncertainty handling.
Minor comments and questions ref specific locations in the paper:
Figure 1: this figure is very unclear and could be any graph as the text is hardly readable. I have been able to read some of the RDF graphs presented in the paper into python and visualize them using an interactive graph viewer, but it may be hard to improve the layout within a static page. As they also link to a visualizer (https://issemantic.net/rdf-visualizer.) the authors could consider leaving it out.
Re: Thank you for your comment. This figure is just to visualize an information model for readers
Table 2: iso ‘prefix’ maybe use ‘URI’ in the header of the second column? If these are supposed to be hyperlinks, the entries for dcTerms, dcType and foaf did not resolve to a valid URL.
Re: Thank you for your feedback. In our case the links resolve/redirect to valid websites.
We have chosen ‘prefix’ as name since it aligns with the terminology that is used by the RDF Turtle (.ttl) files and would therefore keep it. But since we also understand that URI would be a suitable candidate, we went with ‘URI prefix’.

Replies to review #2

Thank you for your detailed and helpful comments and questions. All minor revisions are integrated in our manuscript. Please find in addition detailed answers to your questions and comments below.
Detailed Review:

L89: In reviewing this list R1-R6, this is starting to look like a schema. I would be curious if the authors have the desire the develop this into a full schema. Another schema that is worth referencing is the PIDINST schema: https://github.com/rdawg-pidinst/schema. There are many overlaps between sensors and instruments, and if it is possible, perhaps sensors can simply use the PIDINST schema.
Re: Thank you for your comment, we added an answer regarding PIDINST at your second comment.
A second note about R1 through R6 is that (in my humble opinion), R1-R3, R6 are "musts", whereas R4-R5 are "shoulds". My comments for R4 gives an example to this reasoning.
Re: Thank you for your comment. We carefully double checked all requirements regarding “must” / “should” and revised our manuscript accordingly.
L93 (R4): When I read "machine actionable", I read it to mean that a computer can receive information from a sensor, and that it can compel a sensor to do something. I can understand receiving information from a sensor. I don't think compelling a sensor to do something should be part of the information model. Many sensors are passive (I'm thinking a thermocouple, where you are reading a resistance. Should you compel it do do more than read resistance? I would argue not.
Therefore, I think a clarity of language here is important. While I agree with "machine readable" as the baseline standard. I think "machine enforcible" is a stretch.
From another perspective, a simple glass thermometer is an instrument and it is neither machine readable nor machine enforcible. While you may argue that a glass thermometer is not terribly accurate, it would be silly to require it to be machine actionable. So perhaps, the wording here for R4 should take into account that you can't functionally do much about analog sensors that are nonetheless useful?
Re: Thank you for your feedback. You’re correct in noting that our definition of "machine actionable" might be different from your interpretation. In our context, "machine actionable" means that the data provided by the information model should be interpretable and usable by machines, allowing for automated processes or analysis. This does not necessarily imply control over the sensor itself, but rather the ability for machines to interact with the data represented by the model. We added a footnote to R4 to ensure clarity.
L98: I disagree strongly that components and substance should be part of the "sensor information model". I think components and substances should have their own model(s), and then you emphasize how the models need to interoperate with each other.
Some years ago we tried the uni-model with electron microscopes and the specimen that we examine with them. In the end we failed because we could not come to consensus on what a unified schema might look like for the most obvious electron microscope use case - which was just looking at a specimen with a microscope. This was because "obvious" meant something completely different, depending on the specimen.
If this is not the intended reading of R7-R10, then I think further clarification is needed.
Re: Thank you for your comment. We also disagree and therefore designed 3 different information models. Whereas the underlying logic and the methods for providing and linking the information remain on the same basis. For clarification we rewrote the requirements and stated to which information model it belongs.
L103: And here, that ambient condition could be the humble glass thermometer, which you are not going to exclude just because it isn't machine actionable.
Re: Thank you for your comment. Since this is part of the experimental setup and the experiment itself, which is not part of the information model itself, it is not relevant to the information model. We can easily instance an information model for an analog sensor (like a glass thermometer) whereas the information model itself is again machine actionable. But from our experience we think it is much better practice to measure the conditions, such as the temperature, automatically to get more reliable and comparable time series data.
Especially when, like in our case, very modular test rigs are deployed with a high number of changing sensors and components.
L112: There are international guidelines on how uncertainties are to be reported. For example, BIPM's "Guide to the expression of uncertainty in measurement". I think it would be useful to point this out as best-practices for sensor vendors.
Re: Thank you for the comment. We agree and revised therefore the whole paragraph.
L129: While I agree with the spirit of R14, I can think of at least 2 ways where there can be some trouble (and there are probably many more).
1. Should there be consensus on some controlled vocabulary? Different vendors will often say the same thing differently, and you will never get them to reconcile the differences.
2. What happens when there are conflicts in units? Sometimes you can do a straightforward translation (degrees Celsius to Kelvin). Other times you cannot (T to A/m).
Re: Thank you for this valuable comment. You are absolutely correct that different vendors often use varying terminologies and units, which can present challenges. To address this, we aim to make the information model independent of such variations by leveraging ontologies, such as QUDT for units. These ontologies provide systematic rules for unit transformations, which can either be handled within the model or by external software tools. The model does not require the use of standardized units like SI units. Similarly, for quantities, as long as the terms and units used by sensor manufacturers are represented within the adopted ontologies, we do not foresee any significant issues. This approach ensures flexibility while maintaining consistency and interoperability across different systems and vendors.
L133: Do you mean access control to the sensor? Or access control to the sensor model? I can't imagine someone access controlling my digital voltmeter. If you are referring to the sensor model, ok, I think you can do this with Github.
Re: Thank you for your comment. We indeed mean the information models of the sensor/component/substance and not the sensor itself. We clarified this in the requirement.
L138: I agree with the spirit of this question. However, I would like the author to consider that not all digitally available information should be automatically collected. For example, if you are collecting biographical information on blood serum sample, I would argue that automated collection should only be permissible if applicable privacy guardrails are already in place.
Re: Thank you for this thoughtful comment. We agree that this consideration is particularly relevant in fields like life sciences and that personal information shouldn’t appear in open repositories. However, in the context of our experiments this approach remains applicable. Based on our experience, it is beneficial to retain all quantities and information, as we cannot predict which data might be needed for future investigations. Furthermore, through appropriate access control mechanisms, we can also manage sensitive data—such as confidential drawings of components—while ensuring security and controlled availability
L291: Question: Who will update the table when there is a new substance?
Re: Thank you very much for this question. Every substance has multiple lookup tables which normally don’t change, since the data inside them describes substance properties. Every time a new substance should be created it will be created from the substance information model with a semi-automatic script that downloads the data from the NIST Chemistry WebBook website, creates the tables with the downloaded data, writes them into the HDF5 file and creates the corresponding RDF files.
L293-294: For this to work as you envisioned, you will need your external resources (e.g., CRC Handbook of Chemistry and Physics, the VDI Wärmeatlas, NIST Chemistry WebBook) all follow an agreed upon HDF5 format that is mutually consistent and interoperable with each other, no?
Re: Thank you for your question. If the data from the various sources were made available as HDF5 files, then theoretically yes. However, we are using this only as an interim solution, as there is currently no good RDF-compatible web service for HDF5 files. Once such a service should exist, the data would need to follow some kind of ‘RDF-API’ structure, which also has yet to be developed and probably stands a significant challenge.
L307-308: I appreciate very much this specific treatment of the different uncertainty classes. I think the treatment of uncertainty does not receive enough attention in information modeling, and I am delighted to see the thought that all of you have put into this.
Re: Thank you for this comment. We intensively use this information in our post processing investigations.
L313: While this sensor model has quite a bit more detail than the PIDINST schema, there is also a significant overlap between the two at the higher level. I am very concerned about the proliferation of schemas, because each schema represents many hours of labor from a group of very clever people. As an example, the PIDINST schema, for example, represents consensus after a decade of working through the different ideas of what an instrument is. A diverse group of people, from cellular biologist to beamline scientists came together to form this schema that is an official recommendation from the Research Data Alliance.
I was not personally involved in the development of the PIDINST schema, but I recognize the tremendous value that such an effort brings to the community. I would very much like to see clever folks such as yourselves, do a little bit of data reuse from the results of a larger community effort like this one. The I in Interoperability is truly the most difficult part of FAIR, as it requires not only technical expertise (which many of us are good at), but also social engagement (which I know I am not good at).
Re: Thank you for your thoughtful and detailed feedback. At this stage, we do not plan to establish any formal schemas, although we recognize that doing so would be a feasible and logical next step. Our immediate focus is on implementing our approach across additional test rigs to validate and refine its usability.
While we are not yet deeply familiar with the PIDINST schema, we truly appreciate the tremendous effort that has gone into its development, and we value your suggestion to explore it further. We also share your concerns about the proliferation of schemas and the potential fragmentation this may cause. Our intention is to base our models on widely recognized and established vocabularies to maximize interoperability and ease of interpretation.
We fully recognize that achieving true interoperability requires not only technical alignment but also broader community engagement to ensure shared understanding and adoption. We will take your suggestion seriously and explore how elements of the PIDINST schema might inform or align with our work moving forward.
L330-331: What is the succession plan for this Git repo?
Re: Thank you for this question. Currently there is a lot of documentation open to do. Meanwhile other use cases are being deployed and in a late testing stage. There is currently no concrete plan yet for major further development, but there are still ideas to improve some things and therefore it is still a subject of research and further development.
L360: You explain in the paragraph below why you do this, and I really like your rationale. However, I thought I should point out that URLs are not persistent, which defeats your rule R2 from the start. But perhaps this is not a show-stopper.
Re: Thank you for your thoughtful comment. We also think that normally URLs could change easily and that this is therefore correct.
But we use https://w3id.org/fst/resource/UUID URIs which are resolvable and redirected like mentioned in the paper and also a URL, because the term points to a location (with redirects). We added a detailed explanation why we think that URLs could also be persistent. If this is a naming issue, we would like to go with the term 'URL', as the w3id.org website uses it.
L460-461: So this "sensor_data" must also have a PID otherwise the linkages will also not be permanent.
Re: Thank you very much for paying so much attention to detail and pointing that out. You are correct! There is a mistake in figure 11, which we corrected in the new version.
L478: Love it!!!
Re: Thank you for the comment! We also think that this is an important part of the usage therefore we slightly enlarged the explanation.

Comment #187 Gretchen Greene @ 2024-11-07 07:06

As the responsible topical editor, I would like to thank the reviewers for their constructive feedback.
After consideration of the comments, I advise the authors to revise the publication considering the reviews and their suggestions. Please submit a revised version of the publication within 4 weeks for further consideration.

Invited Review Comment #185 Anonymous @ 2024-11-03 11:00

 

Summary:

The authors use a concrete experimental system as an example to show how data obtained from experiments can be published in a FAIR manner.

FAIR stands for Findable, Accessible, Interoperable, Reusable, and are guidelines that scientists/engineers are urged to follow when they publish their research data to the community [1].

As the authors note, these guidelines do not contain a prescription *how* to implement FAIRnes, and they propose a number of requirements on the (meta)data gathering and publication process that can help achieve this. They propose and implement a data collection framework and apply it to an experimental test environment where data are collected on physical objects using a variety of sensors.

The authors devise information models for the (meta)data describing the different elements in the experimental setup. They provide an implementation where instances of these models are stored on a publicly available GitLab repository, and they show some examples how to use these models.

The work is rigorous, and seems quite comprehensive, the models are detailed, and most of the relevant artefacts are freely accessible.

I do have some general comments and requests for clarification.

Regarding FAIRness.

The FAIR specification offers a number of guiding principles ([1], Box 2). The authors are not explicit how these are implemented in the currentt work. Instead, they present their own list of requirements, some of which are definitely close to the FAIR principles, some of which seem extensions to the specific domain (of which btw I am not an expert), some seem lacking in precise motivation. But it is up to the reader to make the. It would improve the understanding and strengthen their claim that they are applying the FAIR principles if they made the links between FAIR guidelines and their requirements explicit. This could include some explicit statements in case some of the FAIR principles may not have been implemented exactly or completely.

Regarding the list of requirements:
R1,2,3, 4 are integral to FAIR, but in the current text have not received explicit justification. It would be useful to give that here as well.
R7 seems a bit obvious and maybe empty of meaning, models are supposed to contain relevant quantities.
R10 this statement is unclear.
R14: does this mean any experimental setup should be describable with the provided model?
R16 Do the authors mean that the model itself should allow for this expansion, or that one should be able to change the model itself easily (say add new classes, attributes, relations)?
R17 How is version control to be combined with persistent identifiers? Will there be a different PID for different versions of the same experimental results?
R19: why is this a requirement? It seems an implementation detail.

 

Information models, standardization, interoperability

Section 3 presents an overview of the State of the Art for making data FAIR. It contains an overview of some of the relevant technologies such as RDF and a background on ontologies and information models. The authors state that the difference between the latter two is not that well defined, and go on to produce three information models, for Components, Substances and Sensors, as RDF graphs. With all the attention given to ontologies in this section it would be good if the authors could explain why they choose RDF iso OWL.

The authors state in line 233: “…information models …, which are based on existing ontologies.”  It is unclear whether the authors mean to imply that the information models “just” use existing (RDF) vocabularies for annotating certain elements, say as rdf/type, or that they intend to use the class structure of existing ontologies. To that end it might be useful if the authors could provide the RDF graphs for representing the models also in a machine readable format (ttl, json, xml).

Currently the models are only available as diagrams in the paper. These diagrams are not uniform. The diagram for Components and one of the two diagrams for Sensors have some subdivisions which are missing for the model for Substances. The latter seems more comprehensive and similar to the model for Sensors in the appendix. Maybe all three models only need to be presented in that way. Especially since the assignment to these subdivisions seems somewhat inconsistent between the Components and Sensors models. E.g. UUID assigned to METADATA in one, PROPEPRTIES in the other.

Various instances of two of the models (Sensors and Substances) are available as RDF files in a GitLab repository, which is very useful. It would be nice if also examples of the third model for Components were available as well.

The authors make use of standardized vocabularies but add details to describe the specific environment they aim to model. It is generally understood that true interoperability between different data sets needs them to be described using standardized, “global models” (or schemas), at least within a common scientific domain. This should allow users to write code that can deal with the format and contents of each data set, facilitating comparison, joining etc. One could argue that FAIR guideline R1.3 “(meta)data meet domain-relevant community standards” hints at this. Do the authors think that their models are sufficiently standardized to enable this type of interoperability?
Does there exist an effort to provide such standardized models for the domain that relevant to the current paper?

Reference data

For the description of the Substances the authors state that some of the model elements might lead to very large files if serialized as RDF graphs in some of the standard formats (ttl, json, xml). Instead, they propose to use HDF5 files for storing this data and refer to this file from within the main metadata files.

Is there an accepted standard for linking to (elements inside) HDF5 files, or is this ad hoc? Should the HDF5 file correspond to some standard format as well?

And for the example the authors use and refer to in https://w3id.org/fst/resource/018dba9b-f067-7d3e-8a4d-d60cebd70a8a.ttl, it seems that the authors use data extracted from an external source. Were it not better if a reference could be given to that source (NIST Chemistry WebBook) iso using its contents in a separate copy, formatted in a bespoke manner?

Implementation

The authors have implemented the framework using GitLab as the data store. They have added a separate mechanism for providing persistent identifiers using two levels of redirection. This seems an interesting and valid approach. One thing that seems to be missing is FAIR guideline F4: “(meta)data are registered or indexed in a searchable resource”. Of course, the interpretation of that guideline might be upp for discussion, but are the authors planning to provide a semantic query interface into the contents of the metadata objects that could allow users to search for data sets of interest?

The authors provide links to code for various parts of the implementation, but sadly these did not resolve to accessible web locations. I identify some of these locations below. I think this code may be useful to interpret the description of CI/CD pipeline, which may be somewhat obscure without it and require quite some understanding of GitLab pages and CI/CD support, (html) redirects etc.

Applications

I like application 1. It gives a very nice example of the use of an IRI as a QR code on a physical object (a sensor).

I find application 2 somewhat confusing. In particular the introduction af a completely new data structure for a measurement in MATLAB. Indeed I agree with the authors that it would be good if graphs of the experiment could be provided. Would that necessitate a new model (for Measurements)?

Application 3 is rather generic and can be left out.

Conclusion

I like the work and I think with some minor revisions it can be accpted.
I hope that other researchers will be inspired to perform a similarly detailed publication of their research data. And it would be of great interest if others would engage with the framework presented here, for example by trying to use the same models for describing their (meta)data. This could lead to feedback on the models and possibly their evolution towards a domain-specific standard.

 

Minor comments and questions ref specific locations in the paper:

L17,18: This sentence is a bit wonky and I do not quite understand what is intended. Maybe rewrite.

Figure 1: this figure is very unclear and could be any graph as the text is hardly readable. I have been able to read some of the RDF graphs presented in the paper into python and visualize them using an interactive graph viewer, but it may be hard to improve the layout within a static page. As they also link to a visualizer (https://issemantic.net/rdf-visualizer.) the authors could consider leaving it out.

L49,50: This sentence is not well formed. Likely remove one of the two occurrences of “are investigated”.

L168 may want to link to RDF syntax specification:
https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/

L173 IRIs are identifiers of, not necessarily links to information.

L197 “marker” should be “maker”

L249: “When linking, only known semantic vocabulary from established ontologies is used.” should probably be “When linking, only known semantic vocabularies from established ontologies are used.”

L250:” … ontologies are summarised in the following table with its reference and its application…” should probably be “… ontologies are summarised in the following table with their reference and their application…”

Table 2: iso ‘prefix’ maybe use ‘URI’ in the header of the second column? If these are supposed to be hyperlinks, the entries for dcTerms, dcType and foaf did not resolve to a valid URL.

L300: is the boldface and capitalization of Origin intentional? Similar phrase is used in line 273 without boldface. Here it seems to be given a special meaning which is not explained.

L324/5: the link does not resolve to an existing page at the time of this review.

L372: the link in footnote 7 does not resolve to an existing page at the time of this review.

L380: “…necessitates updates…” should probably be “… would necessitate updates …”.

L425 typo “beeing

L430: the link in footnote 9 does not resolve to an existing page at the time of this review.

L504: the links in the URL column of Table 3 do not resolve to an existing page at the time of this review.

 

[1] M. D. Wilkinson, M. Dumontier, I. J. J. Aalbersberg, et al., “The FAIR Guiding Principles for scientific data management and stewardship,” Scientific data, vol. 3, p. 160 018, 2016. DOI: 10.1038/sdata.2016.18. 

 

Invited Review Comment #181 Anonymous @ 2024-10-25 14:27

Review of “How to Make Bespoke Experiments FAIR”

 

The premise of the article is that many scientific workflows are bespoke. The authors listed several use cases. In the case of their tensile test rig, they are wishing to capture information from three buckets: (1) sensors, which include instruments and the various detectors within, (2) components, which are auxiliary physical objects that add value to the experiment, and (3) substances, which could include the specimen itself, and also the surround media that is relevant to the experiment. The proposed schemas for organizing the three information buckets, talked a little bit about RDF and ontologies, and then finally unveiled their information model based on RDF concepts.

 

This is a very charming manuscript. They authors cover much ground, from tracking ambient experimental conditions, to samples, to the propagation of uncertainties. This is ambitious work and it is very much worthy of publication. However, as with any ambitious work, one might expect many criticisms. The questions and comments I have raised are located on the PDF itself. Because they are extensive, it will require significant effort for me to transcribe these into plain text.  I have included a link here where you may retrieve the PDF with comments. While I don’t mind sharing with the authors all my comments, the link here is not anonymous, so I would like the editor to please download the article and then share it through your own mechanism.

 

In any case, a revised manuscript, assuming all of my concerns are addressed, will be a highly valuable contribution to the engineering community.

Downloads

Download Preprint

Metadata

  • Published: 2024-06-03
  • Last Updated: 2024-06-02
  • License: Creative Commons Attribution 4.0
  • Subjects: Data Infrastructure, Data Sets
  • Keywords: FAIR, linked data, modular test environment, information model, experimental data, information infrastructure
All Preprints