Collaborative creation and management of rich FAIR metadata: Two case studies from robotics field research

This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.

Authors

Christian Backe, Veit Briken, Atefeh Gooran Orimi, Rayen Hamlaoui, Malte Wirkus, Bilal Wehbe, Frank Kirchner

Abstract

This paper presents lessons learned from the creation and management of FAIR (Findable, Accessible, Interoperable, Reusable) data and metadata in two recent robotics projects, in order to derive principles and building blocks for collaborative (meta)data management in field research. First, an inventory of metadata purposes and topics is presented, distinguishing between executive metadata necessary for data producers, and rich reusable metadata satisfying the FAIR principles. A model of the metadata creation process is developed and compared with the Metadata4Ing ontology. Second, social aspects of FAIR research data management (RDM) are discussed in the project context and beyond. The primary tasks of a FAIR research data manager are analyzed in three domains: data production team, research domain, and FAIR RDM community. Third, some improvements on prominent data lifecycle models are proposed to support the requirements of collaborative RDM, and to foster an iterative improvement of RDM systems.

Comments

Comment #126 Robert H. Schmitt @ 2024-07-30 10:37

I thank the reviewers for their conscientious and detailed reviews.
In accordance with their recommendations, I advise the authors to revise the manuscript incorporating the changes suggested by the reviewers.
The authors are expected to submit a detailed response to the reviewers, in which they make clear how they considered the reviewers' comments.

Invited Review Comment #125 Kilian Geiger @ 2024-07-26 13:54

In the presented work, the authors discuss challenges they faced when trying to implement modern research data management (RDM), based on their experience in two research projects with large focus on in-field experimental campaigns. After a general introduction to the mentioned research projects, they split their discussion into the three dimensions “content”, “social” and “time”, which are in return each separated into multiple subsections. For each dimension, they state the challenges they faced within the practical application of RDM and what models and approaches they developed to address these.

The presented work gives a very interesting insight into the application of RDM in practical applications, which is often lacking in literature. As stated by the authors, many approaches and models exist in theory, but without any practical validation those are of low value to the community. By now, consensus exists in the research community that RDM is important and will be even more so in the future. Most people don’t have issues with the concept itself but lack the necessary knowledge or tools to properly adhere to requirements as they are defined by the FAIR Guiding Principles. The authors are able to show all the downfalls and challenges that arise, when applying these ideas in practice. While doing so, they criticize that some of the existing models might be too general, which might be true to some degree. They also claim that some of the models can be too restrictive, for example in their wording, but by creating meta-models around these generalized models, they create even more restrictions and rules. For the most part, the presented models and approaches seems to be beneficial for the authors and allow them to do some impressive work in the field of RDM. At the same time, I fear that some of the presented could create additional overhead which can potentially put off researchers that are interested in RDM but unsure where or how to get started. By this, I don’t want to convey that the presented models are not valid, on the contrary, they are very interesting and a great contribution to the field. I am just wondering how approaches like these could be used to decrease the entry hurdle into RDM. It sounds like the authors have developed some tools that assist them in applying their theory to the practical field and thereby present RDM as beneficial instead of another annoyance for the researchers that produce the data. It is unfortunate that those tools are neither presented nor shared by the authors. I hope that future publications are planned to do so and am looking forward to those. Additionally, I find the presentation of other research in the field of RDM a bit lacking. While the authors reference plenty of other work, their contents and the resulting gap, which is trying to be addressed within this publication, does not become clear.

The overall presentation is adequate, however, figures 1 and 4 are of low resolution and the font size in most images could be a bit larger for easier readability. Readability could also be increased by truncating lists of references, such as “[11-15]” instead of ”[11] [12] [13] [14] [15]”. The text is, in general, well written and easy to follow while not relying on colloquialisms too often. Minor issues are addressed in the following list of detailed feedback. Mentioned research data is in parts shared via the publicly available zenodo repository. From what I can tell, the provided data is well documented and adheres to the FAIR Guiding Principles. Authors, Affiliations, Contributions and References are provided in a proper manner. No software is provided, which is understandable, as it is not the focus of the presented work.

 

Line-by-line feedback:

Please note that some of these are more general feedback and thought-provoking questions that do not necessarily need to be addressed as part of this publication. I put a ** before each for easier revision of this publication.

** Lines 3-6: The general FAIR principles should be adhered to either way. Some broken down requirements for practical implementation could be beneficial though.

Line 17: Which three parts? It would make sense to name them here.

Line 22: Personification of “section”, passive voice would be preferable here. I.e., In this section, base elements […] are developed. Although presented/described would fit better, as the development of the model has already taken place.

Lines 86-87: Where is this mathematical model presented? Do you just yearn for a way to mathematically describe the richness? It sounds like you will present on in Subsection 3.1, which is not the case.

Lines 103-104: Some more detail on the existing models would help the reader understand why you decided to create your own model. What do they encompass? What distinguishes them?

** Lines 120-121: Is this “routinely” the case? I feel like this is an area that is still lacking in many cases and tools that facilitate proper RDM would be of great benefit here.

** Lines 124-125: Shouldn't these be the same? Why not have one set of metadata that satisfies both groups? Otherwise, interoperability would not be satisfied.

** Lines 135-138: Usually, data producers are data users as well and should have similar requirements towards the reusability of data. I understand that it is often the case, as you stated, that data producers take some implicit knowledge for granted and try to keep the additional effort and overhead to a minimum. From personal experience, I can tell, that this approach often leads to more headaches, including unnecessary iterations and repeated experiments, than one would like to admit.

Line 155: Why is the input not part of the model?

Line 158: How is the passage of time represented in the model? In the image, top to bottom and left to right are both time-dependent.

Table 8: How can Meta-Metadata, Meta-Meta-Metadata and so on be distinguished? Do they need to be distinguishable? Are they not the same from a model standpoint?

Lines 180-182: This makes me think that the comparison is not really valid. Both models try to achieve different things. The described model is a meta-model, which could be applied to anything and does not really relate to the specific process class of the M4I model.

** Lines 212-215: How can this be adressed?

** Lines 218-220: Why do they need to be in conflict?

** Lines 224-227: Isn't this the problem that should be addressed and that was mentioned in the previous subsection?

Table 9: As you mentioned in the text, limited time frames are a very important factor for the data producers.

Line 259: Missing comma after “of course”

Lines 249-251: Isn't it possible that the "last" (internal) order always points to an external definition, which is properly defined?

Lines 253-254: Isn't this exactly what standards and open vocabularies are for?

** Lines 272-277: Are they really that different? Wouldn't it be more beneficial if they were identical?

Lines 278-285: Doesn't the circular structure imply this?

** Lines 290-293: Is this still part of the data management?

** Lines 293-295: Is naming really that important that is warrant another standard/model? I understand that your model still excides the goals and border of the NFDI model, but I feel like different naming conventions just increases potential confusion and hesitation people have towards RDM.

** Line 310-311: This also needs to be known and properly understood, which can take a lot of effort.

Lines 312-315: What are some examples? What steps were taken during the two projects?

Line 318: “, e.g.,”

Line 319: “units,”

Line 320: Who is responsible for this?

Line 324: “so it is easy to miss important information, e.g., critical failure”

** Line 336: Shouldn't all of this happen before creating the data?

Lines 353-354: Wouldn't this be part of the Evaluation phase?

** Lines 360-368: Wouldn't this be a sign of insufficient planning?

Lines 377-379: How is this different to the planning phase of creation and provision?

Line 380: Quality assurance of what? What does this entrail?

** Lines 395-398: How is this decided and defined? Could this be declared in the (meta‑)metadata?

Line 427: “itself”

** Lines 439-441: Is the evaluation useful before the other "outer" steps have taken place?

 

Overall, I would recommend this work for publication after minor revision.

Invited Review Comment #124 Anonymous @ 2024-07-23 12:45

Collaborative creation and management of rich FAIR metadata: Two case studies from robotics field research

Summary: 

The article „Collaborative creation and management of rich FAIR metadata: Two case studies from robotics field research“ presents lessons learned from implementing research data management in collaborative research projects in robotics field research. It outlines the metadata creation process and the requirements of producers and re-users of research data, and presents a metadata creation process based on the Metadata4ing ontology. The responsibilities of a research data manager towoards the different stakeholders (data production team, re-users, and RDM community) are discussed in this context. Finally, a self-improving data lifecycle workflow is derived from the fact that data production is usually carried out in several cycles, providing an opportunity to improve RDM within a project. Using this data lifecycle, lessons learned from two projects are presented.

 

Novelty and Originality

The novelty and originality of the paper is rated as moderate or low. The introduction is kept rather short and generic, outlining the general benefits of real-world examples of RDM for practitioners. The authors do not motivate the need for openly available robotics datasets, although the availability of robotics data is currently a hot topic in AI research with the advent of Robotics Foundation models. It is difficult for the reviewer to assess novelty, as no statement is made as to whether and to what extent the presented resutlts go beyond the current state of the art, and no similar or related work presenting real-world examples is mentioned in the text. 

The reviewer suggests revising the introduction and concentrating on outlining the novelty of the work.

Major comments:

·         It is not clearly articulated why the robotics examples were used and how the robotics community benefits from the results presented in this paper. The findings and the way they are presented seem to be applicable to any project that generates research data in multiple steps and iterations. The reviewer suggest to clarify whether the work should mainly 

·         From the way the findings are presented, it is not clear to the reviewer whether and how the models and concepts presented in Sections 3 and 4 were applied already during the two projects or whether the models and findings were extracted from the experience in the two projects.

·         A major language revision should be undertaken. There is extensive use of inserted subordinate clauses, which is unusual for the English language. In particular, there is an exaggerated use of commas, mostly in places where no comma is required in English. 

·         The objective of the RoBivaL project is missing in both Section 2.1 and Table 5 (what was the purpose of the evaluation and how was the performance of the different robot types evaluated?) The objective of the RoBivaL project is missing in both Section 2.1 and Table 5 (for which purpose and how was the performance of the different robot types evaluated?). 

·         L. 103-104: The sentence is incomplete. 

·         It would be a service to the reader if paragraphs ll. 109-118 were supported by a figure or table summarizing the FAIR principles with their attributes. 

·         Ll. 131-141: It is unclear to the reviewer how these motives were derived. What methods were used to arrive at this conclusion?

·         Ll. 207-211 is too narrative and exaggerated for the usual neutral tone of scientific writing - The message of this paragraph is already conveyed by the previous paragraph. Therefore, the reviewer suggests taht this paragrpah should be omitted.

·         Section 5.2 is narrative and the reader does not learn much about the effectiveness and succes of the RDM implemented in the two projects. The reviewer suggests to revise Section 5 making clear the connection to Sections 3 and 4 and outlining where the followed RDM strategies were succesful and adequate and in which phases and situations they failed.

Minor Comments:

 

·         There is quite extensive use of italics (e.g. l.20 "reusable" or l.29 "and iterative"), but it is not obvious to the reader why certain words are emphasised in this way.In addition.The reviewer suggests that this form of emphasis be kept to a necessary minimum.

·         L.27 and L.191: "where" refers to "a FAIR manager", so the singular form should be continued in the sub-clause.

·         Table 1 and 3 are actually figures.

·         Ll. 70-73: The sentence has a very complicated structure and should be split into two parts.

·         In the case of direct quotations (e.g. ll. 100 – 102), the references should be given immediately after the quotation.

·         L. 98: distinguishes (n missing)

·         L. 186: multiple (l missing)

·         L. 209: demonstration (n missing)

·         L. 247: Why is there a colon before the new paragraph?

Downloads

Download Preprint

Metadata
  • Published: 2024-06-10
  • Last Updated: 2024-05-27
  • License: Creative Commons Attribution 4.0
  • Subjects: Data Infrastructure
  • Keywords: Research Data Management, RDM, Metadata, Field Data, FAIR, Robotics
All Preprints