From Ontology to Metadata: A Crawler for Script-based Workflows

This is a Preprint and has not been peer reviewed. A published version of this Preprint is available on ing.grid . This is version 2 of this Preprint.

Authors

Giuseppe Chiapparino  , Benjamin Farnbacher, Nils Hoppe, Radoslav Ralev, Vasiliki Sdralia, Christian Stemmer 

Abstract

The present work introduces HOMER (High Performance Measurement and Computing tool for Ontology-based Metadata Extraction and Re-use), a python-written metadata crawler that allows to automatically retrieve relevant research metadata from script-based workflows on HPC systems. The tool offers a flexible approach to metadata collection, as the metadata scheme can be read out from an ontology file. Through minimal user input, the crawler can be adapted to the user's needs and easily implemented within the workflow, enabling to retrieve relevant metadata. The obtained information can be further automatically post-processed. For example, strings may be trimmed by regular expressions or numerical values may be averaged. Currently, data can be collected from text-files and HDF5 files, as well as directly hardcoded by the user. However, the tool has been designed in a modular way, so that it allows straightforward extension of the supported file-types, the instruction processing routines and the post-processing operations.

Comments

Comment #115 Kevin Logan @ 2024-06-10 15:37

I am happy to inform the authors that, on consideration of the new review comments, the submission is accepted for publication as a peer reviewed article in the journal ing.grid.

I would like to thank both reviewers once again for their detailed and thorough reviews of the submission. Further thanks go to the authors for reworking their submission to include the improvements recommended by the reviewers.

Invited Review Comment #114 Anonymous @ 2024-06-06 11:32

The authors have improved the paper and answered my comments properly. I particularly appreciate the improved structure and added explanations making the subject more accessible to the reader. 

The only aspect I stumbled upon again is the usage of data/metadata in plural form. The authors have explained their view on the matter - even though it appears to me as an uncommon use of the language, the aspect is minor and I think is acceptable after all.

I commend the authors for this interesting paper and tool. I recommend publication.

Invited Review Comment #111 Anonymous @ 2024-06-03 12:37

The authors of the paper „From Ontology to Metadata: A Crawler for Script-based Workflows; HOMER: an HPMC tool for Ontology-based Metadata Extraction and Re-use“present an approach to extract and use metadata within the workflows of High Performance Computing systems. The work presented in the paper is relevant in the current context of data explosion and complexity in managing this huge data and subsequent loss in keeping track and hence re-generating them from scratch. The authors provided good arguments on the necessity of a tool capable of understanding the semantics behind the data and their metadata. However, the paper is not an easy and smooth read. There were jumps at places which made the reading difficult at places. 

The paper is presented more as a technical paper than a research one. It would have added value if the authors had presented the research significance on the application of semantics through tools such as HOMER in data generated through HPMC system rather than explaining Python script models in detail. 

The authors have presented the research significance that is inline to the FAIR principle through this solution. It in my view is very reasonable.  The paper has also been restructured to present research significance and not technical accomplishment though I am not convinced with the section 3 Code Description. If the section could have been written without putting the load on technical details on Python script, it could have been better. 

The paper presents the work involving multiple disciplines. Therefore authors should take extra care when introducing technical terminologies. Authors should explain the technical terms as simply as possible before their usage in the paper. For example the term „ontology“ appeared abruptly in line 19 might be new to experts in the field of fluid mechanics or the abbreviation CFD might be new to semantic experts. Authors should balance their usages not just by defining them but also by explaining how they appear in the overall picture of the research work. Moreover, in places the explanations provided are misleading: ontologies are generally not developed just to define controlled vocabularies, they are developed to provide a semantic background to data. Controlled vocabularies present semantics behind the vocabularies through knowledge organization systems (KOS) as Simple Knowledge Organization Systems or SKOS. Such KOS can be presented through Ontologies.  

The re-phrasing or rather elaboration of ontologies not only as a tool to present controlled vocabularies but also semantic relation for machine interpretation is acceptable though not very accurate. Therefore, I would accept the re-phrasing.  

Abbreviations should be expanded before their usage further in the paper: CFD (Computational Fluid Dynamics) workflows at line 30 should have been expanded first but are extended later on lines 166/167. The same goes for RDM (Research Data Management, line 40). Likewise, Metadata4Ing (Metadata for Ingenieur [Engineers], line 23) should have been explained in some detail.

This has been taken care of.

The authors should use phrases to describe the technical terms as far as possible. For example,  the sentence „The code has been developed as a collection of routines, each performing a different action, rather than as a single script.“  within section 2.1, lines 113/114 could have been explained through the phrase modular routines. The phrase however is used in line 117 in the section.

This has also been taken care of.

Included here are some other inquiries and suggestions (only agreed/not agreed on the modified text is mentioned below. If nothing is mentioned then the text modification answers the comments):

  • The three keywords „path“, „type“, and „pattern“ cannot be seen in the actual HOMER implementation section. Therefore, are these keywords specifically meant for Pizza example?
    • Pizza example has been changed
  • Do the sentences between 160 and 162 „This involves……simulation code is run again.“ mean there is a learning mechanism that trains for the parameter for the configuration file for later?
    • I could not find the answer
  • In line 170 it should be aeroplane's not airplane's
    • Airplane still exists but I will take it
  • Is there any specific reason behind choosing Metadata4Ing? Why broadly used provenance ontology Prov-O is not used as an example? Most users might be able to relate with Prov-O as it is standardized by W3C and not with Metadata4Ing. M4I ontology is a rather new one.
    • Authors have tried to relate Prov-O with M4I but still, the argument behind using M4I is not very satisfactory
  • What is the impact of the choice of another ontology (e.g., Prov-O) on the five steps of the Homer process?
  • Is „Pizza Ontology“ the same ontology provided by Stanford University in the Protégé tutorial?
  • Is „Vegetarian Pizza“ „White Pizza“? It is a bit confusing here because vegetarian pizza is never mentioned as one consumed by the three users.
    • The example changed to CFD
  • In lines 279/280, it is mentioned that the (meta)data is extracted from plain text files, Will it be possible if data are available online?
    • The marking done by authors indicates „yes“.
  • The texts in lines 325 to 326 are confusing. What do authors mean by intended use: is it meant for different days, different numbers of pizzas, or different numbers of users?
  • what are menus 1-6 in line 327?
  • Section 3.2 „Simple CFD-like application“ should provide an actual insight into the workflow but the section is too short and not adequately explained. The authors should prioritize this section as the main engineering application of the HOMER tool.
    • This is the case now
  • Is it necessary for the users to have an ontology file? What if the user does not have it what will happen?
  • What are those further user inputs required for centralized mode?
  • It would help if the figure is on the same page as it is mentioned.
  • Is it true that the users have to develop ontology based on standardized ontology like m4i?

 

 

Comment #108 Giuseppe Chiapparino @ 2024-04-19 19:32

---- Answer to Editor’s comment
We, the authors, would like to thank the editors of ing.grid for the consideration of our manuscript and for the opportunity to submit it to the journal in the first place. We would also like to thank the editors for the possibility to upload now a revised version based on the reviewers’ comments addressed below.
Moreover, our gratitude goes to the reviewers, who have provided us with many valuable and constructive comments. Their feedback has been addressed below as thoroughly as possible from our side. We hope to have answered all the concerns brought up by the reviewers.
In the revised text, the changes in response to the comments and issues raised by Reviewer 1 are marked in red, while those addressing the comments of Reviewer 2 are marked in blue. In case of shared comments (namely in the new section 4), the text is highlighted in red, but the section or subsection title
is marked in blue.

General revision: As both reviewers have brought up this issue, we have reprioritized the example used to show the capabilities of HOMER, the metadata tool presented in this work. To showcase the aim and the application of the tool in a better and clearer way, the main example now focuses on a simple CFD case, while the ”Pizza-ontology” example has been almost entirely removed. Nevertheless, the reader can find all the references to the ”Pizza example” in the introduction to Section 4, as this example is still the main showcase present in the GitLab repository of the code. There, the reader/user can find all the files, as well as a detailed step-by-step guide, as indicated in Section 4.


---- Answer to Reviewer 1
Major comments from reviewer 1:

- Strengthen objective: In the introduction, the authors mention many other relevant metadata crawlers or extract tools. Together with a brief description of the respective tool’s capabilities, the authors also mention limitations or shortcomings. When introducing their own tool, HOMER, the authors mention
what the tool does. Moreover, in my view, the authors should strengthen their objective and specifically highlight what differentiates HOMER from the other previously mentioned tools (e.g. iCurate, signac, Xtract, ExtractIng, etc.). In other words: is there a difference in scope, why is this tool required, or why did
they not use one of the other existing tools?
Answer - The main features of HOMER have been summarized in section 1, in order to highlight the differences with respect to the other tools, whose shortcomings have been mentioned in the literature review. The revised paragraph now reads as follows: The tool is designed to be flexible and adjustable
to the user’s needs in its application and easy to implement in potentially any HPMC workflow. This development approach tries to overcome all the shortcomings highlighted for the RDM solutions and tool reviewed in this section, and to allow HOMER to be suitable for a wide range of applications. In fact, the
crawler can retrieve metadata from text and binary (HDF5) files, as well as from user’s annotations and terminal commands, at any stage of the workflow without interfering with the other processes composing the workflow. The automated extraction of metadata can be performed both in edge as well as in central mode, making the tool suitable for extracting information also from central repositories (such as data lakes). The metadata extraction is based on the ontology schemes chosen by the users. However, the users do not need to strictly adhere to a fixed scheme, but can adjust and customize it according to their needs. Moreover, although developed primarily keeping engineering sciences as the main use application, HOMER can be employed to retrieve metadata from HPMC workflows applied to a wide variety of research fields. Finally, the tool has been written with a modular structure, so it can easily be developed further to include new features. Hence, HOMER is proposed as a flexible and consistent RDM tool that
can be used in a wide variety of applications and fields with limited user inputs in order to easily promote the FAIR principles and enrich the data created by the user.

- Improve paper structure: In my view, the authors are describing a solution, i.e. the tool HOMER, before characterizing the problem in detail. This structure makes this paper difficult to read at times. It would be very beneficial for readers to (1) first illustrate a sample workflow without HOMER, (2) then highlight the critical points from a RDM perspective, and (3) finally illustrate the solution/improvement with HOMER. At the moment, aspects and reasoning about the tool remain somewhat abstract and the authors need to point to aspects discussed in downstream sections (not ideal) before it starts to make sense after Sec. 3.2 (too late!). This also allows to address how the user’s workflow changes, i.e. which overhead is caused using the tool, in comparison to the original (sub-optimal) workflow. I think the paper has all of the components it needs, but I strongly suggest to reorder them to improve readability.
Answer - The paper has been partially restructured in order to improve readability and clarity. A new, short section 2 (”Characterization of the problem”) has been added to better illustrate the typical situation where the use of HOMER would be beneficial. Figure 2 has been now moved to Section 3 (”Code description”, see also last minor remark). In Section 4 (”Example application”), the CFD case is now taken as main showcase for the application of HOMER (see also next point).

- Choice of the example(s): In Sec. 3 the authors first introduce the example of the simulation of an airplane wing, which is briefly discussed, but then dropped (why?) in favor of a ”pizza example”, before coming back to it in Sec. 3.2. Given the scope of the journal for the engineering sciences, I would have
strongly preferred the illustration of the tool’s capabilities with the wing simulation only (because: relevance). Interpreting persons as ”processing steps” (line 197) and menu properties as ”computational variables” appears overly complicated, if the alternative example can directly be the wing simulation process with parameters engineers readily relate to. I think the pizza example is redundant and can be omitted in favor of a more detailed CFD example. This is just an opinion, but I wanted to mention this for consideration by the authors.
Answer - Following both reviewers’ advice, we have decided to remove the ”pizza example” altogether. Nevertheless, in the short section 4.1, the reader is pointed to the tutorial based on this general use case and available in the GitLab repository. In the revised paper, the main example in section 4.2 focuses on a
CFD application (airplane wing) where the code NSMB is employed.

- Robustness of the method: I think the authors should discuss how reliant the tool is on standardized file formats. Considering the pizza example: what would happen if the format/structure and strings on the menu change? Assuming the underlying process and its file formats frequently change, the user would
have to adapt the regex patterns every time to instruct the parser and ensure proper data capture. If I understood correctly, every process to be tracked by the tool requires a-priori standardization and the user must provide proper instructions to help the parser retrieve the required information. I could imagine
that there are applications where this is non-trivial and consumes a considerable amount of time.
Answer - A paragraph addressing this matter has been added at the end section 4, and reads as follows: Regarding the limitations of HOMER, the tool works best within standardized workflows as common in CFD investigations, where the structure of the files containing the metadata to be extracted changes
very little or not at all. Although, as shown, it would be possible to adapt the crawler, and in particular the multiplexed dictionary to new file structures thanks to the flexibility of the tool, such an adaption could take a considerable amount of time and effort from the user side if performed for every new application of a
(changing) work flow. Hence, it appears sensible to limit the use of the crawler to cases where well-known and relatively fixed data structures are employed as it is common in most numerical and experimental research projects. The second limitation is the range of data formats the crawler can currently extract metadata from, which is limited to text and HDF5 files, together with outputs of terminal commands and hardcoded lines. Although the regular-expression parser allows to retrieve information from virtually any text file regardless of its extension, commonly used formats such as .xml have not been implemented yet. As the crawler is designed flexible, this would be a straight forward process.

Minor comments:
- General: HOMER is an abbreviation for ”HPMC tool for Ontology-based Metadata Extraction and Re-use” - explaining one abbreviation (HOMER) with another (HPMC) is inconvenient - I had to look up the term HPMC before understanding what HOMER actually means. I suggest to avoid secondary
abbreviations in the explanation, even if this makes the explanation longer when introducing it for the first time.
Answer - This aspect has been now corrected in the text. The first time the name HOMER is explained in the abstract, the full phrase ”High Performance Measurement and Computing” is used in place of the acronym ”HPMC”. Moreover, the subtitle of the paper now does not contain the direct explanation of
the acronym ”HOMER”, but rather a general explanation of the code.

- line 5/6: ”Here comes Research Data Management to provide an efficient solution to these problems.” - perhaps a bit picky, but ”also bad research data management is research data management”. Every scientist has a management strategy, whether it was consciously developed or not. If a specific RDM is
actually ”an efficient solution” depends on its details. Perhaps rephrase.
Answer - The paragraph has been rephrased as follows: Although every researcher implements some sort of Research Data Management (RDM), either consciously or unconsciously, to avoid the loss of precious information, standardized RDM approaches, such as the FAIR data principles (Findable, Accessible, Interoperable, and Re-usable), have been proposed in order to provide a more structured and potentially efficient solution to these problems.

- lines 62/63: both Docker and Globus should have references.
Answer - The references to Docker [3] and Globus [4, 5] have been added.

- terminology: what is the conceptual difference between a ”sharable dataspace” (page 2, line 48) and ”very large data lakes” (page 3, line 57)? This is not explicitly mentioned in the paper and I doubt this is common knowledge. My best guess is that the authors adopt terminologies introduced by the authors of
the reviewed tools (signac, Xtract), but I think they should aim for a consistent terminology throughout their own work.
Answer - The two terms are now explained in the revised paper: ... shareable dataspace (a decentralized infrastructure for data sharing and exchange based on commonly agreed principles [1] ... data lakes (centralized systems storing data in raw format [2]). Moreover, we have tried to improve the usage of the
terminology in the rest of the paper, as well.

- Fig. 1: what is a ”flat class”? This warrants further explanation.
Answer - A flat class is the prototype or template of a class. The explanation of the term has been added to the text in section 3.2 as follows: The first step consists in reading the ontology file and creating an empty dictionary containing the flat classes as specified in the ontology. A “flat class” in this context is the initial state of the class in which the properties have not been specified, yet.

- line 123: what is ”edge mode (where the data are generated)” - (1) data has no plural, ”data is generated”, (2) I still do not understand what ”edge mode” means, this warrants further explanation.
Answer - About comment (1), please see next point. Regarding point (2), in the revised paper, both terms ”edge mode” and ”central mode” (mentioned in second-to-last comment) have been more clearly explained in section 1 as follows: With Xtract, metadata can be extracted both centrally (“central mode”, i.e. fetching the metadata all at once from the different repositories where the data are generated) and at the edge (“edge mode”, i.e. extracting and storing the metadata as soon as the data is created and at the location where it is generated).

- line 199: ”metadata are extracted” should read ”metadata is extracted” (see also previous comment) - please correct throughout the text.
Answer - In this work, the words data and metadata are used in the plural form as they refer to ”pieces of information” (plural of datum) rather than to ”information” as a collective noun. Since the crawler actually retrieves multiple data, we have decided not to implement this comment.

- lines 201-211 represents file contents of the pizza menu (this is the menu correct? not fully clear from the upstream text). For this and all the following occurrences of file contents, I suggest to use a separate code environment which is formatted separately and also has a caption.
Answer - The pizza example has been removed and has been replaced by the CFD example, which follows the same structure. Hence, the modifications will be done on that part of the article. The lines of code are already written in a separate environment with its own formatting (verbatim) and can be also identified by the grey line numbers on the side (in the review version these are covered by the line numbers in purple). We have now added a caption to all the listings in the article to better separate the code environment from the rest of the text.

- also the term ”centralized mode” warrants further explanation.
Answer - For this comment, refer to the answer about the meaning of ”edge mode”.

- line 390, typo: ”code is designed”.
Answer - The typo has been corrected.

- It is somewhat uncommon to introduce additional figures (i.e. Fig. 2) in the conclusion section. To me, Fig. 2 would have better been presented in when describing the code’s characteristics, Sec. 2.1.
Answer - The figure has been moved upstream to section 3.1 (Code’s characteristics).

---- Answer to Reviewer 2
Major remarks from reviewer 2:
- The paper is presented more as a technical paper than a research one. It would have added value if the authors had presented the research significance on the application of semantics through tools such as HOMER in data generated through HPMC system rather than explaining Python script models in detail.
Answer - In our opinion, the structure and the tone of the paper are in line with the focus and scope of the ing.grid journal, which includes submissions of manuscripts and tutorials as well as research software. In general, we want to show the benefits of standardized metadata extraction, using an exemplary
use case. The advantages and the added value are described in more detail in the revised edition (section 1): One major factor of making data FAIR is a controlled vocabulary / common terminology. The use of a controlled vocabulary is essential for findability, interoperability, and consequently, the re-use and the establishment of new user models. Most of the research data in the HPMC domain is neither documented nor are metadata sets available, as common terminologies for HPMC in the engineering sector still need to be developed and established within the community. HOMER allows to automatically retrieve relevant research metadata from script-based workflows on HPC systems and therefore supports researchers to collect and publish their research data within a controlled vocabulary using a standardized workflow.

- The paper presents the work involving multiple disciplines. Therefore authors should take extra care when introducing technical terminologies. Authors should explain the technical terms as simply as possible before their usage in the paper. For example the term ”ontology” appeared abruptly in line 19 might be new to experts in the field of fluid mechanics or the abbreviation CFD might be new to semantic experts. Authors should balance their usages not just by defining them but also by explaining how they appear in the overall picture of the research work. Moreover, in places the explanations provided are misleading: ontologies are generally not developed just to define controlled vocabularies, they are developed to provide a semantic background to data. Controlled vocabularies present semantics behind the vocabularies through knowledge organization systems (KOS) as Simple Knowledge Organization Systems or SKOS. Such KOS can be presented through Ontologies.
Answer - A definition and the overall context has been included in the revised paper in section 1: An ontology defines a shared conceptualization of a common vocabulary, semantic relations of data and the syntactic as well as the semantic interoperability, including machine-interpretable definitions of basic concepts in the domain and the relations among them.

- Abbreviations should be expanded before their usage further in the paper: CFD (Computational Fluid Dynamics) workflows at line 30 should have been expanded first but are extended later on lines 166/167. The same goes for RDM (Research Data Management, line 40). Likewise, Metadata4Ing (Metadata for
Ingenieur [Engineers], line 23) should have been explained in some detail.
Answer - A stringent usage of abbreviations and further details regarding the Metadata4Ing subontology have been included in the revised paper: The expansion to an HPC-sub-ontology is based on modularity and fits in the primary Metadata4Ing classes of method, tool, object of research. The expansion includes
suggestions of unambiguous terms for domain related metadata expressed in classes, object properties (relations) and data properties. These classes have been developed in a community-based approach and represent common methods and tools for workflows in engineering research on HPMC systems.

- The authors should use phrases to describe the technical terms as far as possible. For example, the sentence ”The code has been developed as a collection of routines, each performing a different action, rather than as a single script.” within section 2.1, lines 113/114 could have been explained through the phrase modular routines. The phrase however is used in line 117 in the section.
Answer - Wherever possible, phrases have been used to explain concepts, at the same time trying to avoid unnecessary repetitions. Regarding the ”modular routines”, the text has been amended as follows: The code has been developed as a collection of modular routines, each performing...

Included here are some other inquiries and suggestions:
· The three keywords ”path”, ”type”, and ”pattern” cannot be seen in the actual HOMER implementation section. Therefore, are these keywords specifically meant for Pizza example?
Answer - The three keywords are always part of the .json file created after the ”Multiplexer” step. They did not appear in the original short CFD exam-le because we only extracted metadata. However, in the revised paper, the keywords are used directly in the CFD case (see Listings 3 and 4).

· Do the sentences between 160 and 162 ”This involves. . . . . . simulation code is run again.” mean there is a learning mechanism that trains for the parameter for the configuration file for later?
Answer - A learning mechanism is not employed in the process. In the revised paper, the paragraph has been re-phrased for better clarity as follows: However, once the first setup has been completed, the configuration file (created at the end of step four) can be re-used with little to no modification every time a
new simulation is performed by the user and new metadata need to be extracted. Specifically, step five simply needs to be performed in the subsequent runs.

· In line 170 it should be aeroplane’s not airplane’s
Answer In the aerodynamics community, the term ’airplane’ is the common term even in British English usage and we kept it according to the common use in the specific discipline.

· Is there any specific reason behind choosing Metadata4Ing? Why broadly used provenance ontology Prov-O is not used as an example? Most users might be able to relate with Prov-O as it is standardized by W3C and not with Metadata4Ing. M4I ontology is a rather new one.
Answer - Metadata4Ing has been chosen as exemplary ontology because it is developed within the NFDI Consortium NFDI4Ing with the aim of providing a thorough framework for the semantic description of research data, with a particular focus on engineering sciences and neighboring disciplines. Metadata4Ing
re-uses elements from the existing ontologies, including the PROV (Provenance Namespace) ontology, whose terms were imported into metadata4ing. We use metadata4ing to show the functionality of the crawler, but the workflow is generally transferable to other ontologies.

· What is the impact of the choice of another ontology (e.g., Prov-O) on the five steps of the Homer process?
Answer - The only difference would be in the dictionary containing the flat classes which is created at the end of step 1. Since a different ontology will contain different classes and properties, the list in the .json file would be different. All the following steps will remain the same as described in the paper. This aspect emerges more clearly now in the discussion at the end of section 4: Another remark is that the user doesn’t need to adhere strictly to the chosen ontology file, nor does the user have to use an ontology based on Metada4Ing.

· Is ”Pizza Ontology” the same ontology provided by Stanford University in the Protege tutorial?
Answer - Yes, the pizza ontology is a simplified ontology, for showcasing purpose only, based on the Protege tutorial. This is now clearly stated in the text and the reference [6] has been added, as well.

· Is ”Vegetarian Pizza” ”White Pizza”? It is a bit confusing here because vegetarian pizza is never mentioned as one consumed by the three users.
Answer - The ”pizza-ontology” example has now been removed from the article, hence a clarification is not needed anymore. Anyways, the ”vegetarian pizza” was a class extracted from the original .owl file and then not used in the example. This was, however, not explained in the original version of the paper.

· In lines 279/280, it is mentioned that the (meta)data is extracted from plain text files, Will it be possible if data are available online?
Answer - At the moment, the feature is not available, but it might be included in the future updates of the tool.

· The texts in lines 325 to 326 are confusing. What do authors mean by intended use: is it meant for different days, different numbers of pizzas, or different numbers of users?
Answer - The ”pizza-ontology” example has now been removed from the article, hence a clarification is not needed anymore. Anyways, by ”intended use”, we meant the usage of the crawler either in central mode or in edge mode. These two usages were then explained in the remaining part of the paragraph (lines 327-336)

· What are menus 1-6 in line 327?
Answer - The ”pizza-ontology” example has now been removed from the article, hence a clarification is not needed anymore. Anyways, the menus 1-6 referred to the 6 different files, each containing a different variant of the menu file, that can be found in the online repository of the pizza example.

· Section 3.2 ”Simple CFD like application” should provide an actual insight of the workflow but the section is too short and not adequately explained. The authors should prioritize this section as the main engineering application of the HOMER tool.
Answer - Following both reviewers’ advice, we have decided to remove the ”pizza example” altogether. Nevertheless, in the short section 4.1, the reader is pointed to the tutorial based on this general use case and available in the GitLab repository. The main example now focuses on a CFD application (airplane wing) where the code NSMB is employed.

· Is it necessary for the users to have an ontology file? What if the user does not have it what will happen?
Answer - Strictly speaking, it is not necessary to have an ontology file. This is now explained at the end of section 4 as follows: In fact, an ontology file is not even necessary for the actual extraction of the metadata, in principle. The user could even create its own .json file with its own classes and properties skipping the first two steps altogether.

· What are those further user inputs required for centralized mode?
Answer - This has been clarified in the revised text as follows: The other option would be to use the crawler after all the simulations have been run in order to retrieve all the metadata at once, which corresponds to using the crawler in central mode. In this case, the user will need to specify absolute file paths (via the keyword "path" in the multiplexed .json file) to point to the files containing the information. This means that the user needs to create an extra script that allows the crawler to search all the relevant folders and files. Such a script would be specialized according to the user’s simulation environment and workflow. Hence, no example of such a usage can be given in the context of the generic CFD-showcase described in this work. However, an example script is provided in the GitLab folder for the Pizza-ontology tutorial.

· It would help if the figure is on the same page as it is mentioned.
Answer - The figures have been rearranged so to have them in the same (or following) page as they are mentioned.

· Is it true that the users have to develop ontology based on standardized ontology like m4i?
Answer - It is not required to develop an ontology like m4i. In the revised paper, we have added the following paragraph in section 4: Another remark is that the user doesn’t need to adhere strictly to the chosen ontology file, nor does the user have to use an ontology based on Metada4Ing. At any of the steps where manual input is needed, the user can adjust the classes and properties according to the case-specific needs. ... In fact, an ontology file is not even necessary for the actual extraction of the metadata, in principle. The user could even create it’s own .json with its own classes and properties skipping the first two steps altogether.

---- References
[1] Nagel L., Lycklama D. (2021): Design Principles for Data Spaces. Position Paper. Version 1.0. Berlin. DOI: http://doi.org/10.5281/zenodo.5105744

[2] Dixon J. (2010). Pentaho, Hadoop, and Data Lakes. Available online at: https://jamesdixon.wordpress.com/2010/10/14/pentaho-hadoop-and-data-lakes/

[3] Merkel, D., ”Docker: lightweight linux containers for consistent development and deployment”, Linux journal, vol. 2014, no. 239, 2. DOI:10.5555/2600239.2600241

[4] Foster, I., ”Globus Online: Accelerating and Democratizing Science through Cloud-Based Services”, Internet Computing, IEEE, vol. 15, no. 3, pp. 70-73, 2011. DOI: 10.1109/MIC.2011.64

[5] Allen, B., Bresnahan, J., Childers, L., Foster, I., Kandaswamy, G., Kettimuthu, R, Kordas, J., Link, M., Martin, S., Pickett, K. and Tuecke, S., ”Software as a service for data scientists”, Internet Computing, IEEE, vol. 55, no. 2, pp. 81-88, 2012. DOI: 10.1145/2076450.2076468

[6] Musen, M., ”The protege project: a look back and a look forward”, AI Matters, vol. 1, no. 4, pp. 4-12, 2015. DOI: 10.1145/2757001.2757003

Comment #91 Kevin Logan @ 2024-02-05 09:55

As managing editor of ing.grid, I thank the reviewers for their conscientious and detailed reviews. In accordance with their recommendations, I advise the authors to revise the manuscript incorporating the changes suggested by the reviewers. The authors are expected to submit a detailed response to the reviewers, in which they make clear how they considered the reviewers' comments.

Invited Review Comment #87 Anonymous @ 2024-01-25 16:05

The authors present a metadata crawler tool called HOMER for script-based applications running on HPC systems. Their objective is a method for automatic retrieval of metadata. The tool is linked to an ontology file for appropriate terminology. The work originates from a project embedded in NFDI4ING, archetype DORIS.

Overall, the manuscript is well written and easy to follow, even though I think the overall structure of the paper needs a revision (see comments below). The topic of metadata retrieval is a very relevant question for research data management. Before I can recommend the paper for publication, the authors should consider the following major comments:

- Strengthen objective: In the introduction, the authors mention many other relevant metadata crawlers or extract tools. Together with a brief description of the respective tool's capabilities, the authors also mention limitations or shortcomings. When introducing their own tool, HOMER, the authors mention what the tool does. Moreover, in my view, the authors should strenghten their objective and specifically highlight what differentiates HOMER from the other previously mentioned tools (e.g. iCurate, signac, Xtract, ExtractIng, etc.). In other words: is there a difference in scope, why is this tool required, or why did they not use one of the other existing tools?

- Improve paper structure: In my view, the authors are describing a solution, i.e. the tool HOMER, before characterizing the problem in detail. This structure makes this paper difficult to read at times. It would be very beneficial for readers to (1) first illustrate a sample workflow without HOMER, (2) then highlight the critical points from a RDM perspective, and (3) finally illustrate the solution/improvement with HOMER. At the moment, aspects and reasoning about the tool remain somewhat abstract and the authors need to point to aspects discussed in downstream sections (not ideal) before it starts to make sense after Sec. 3.2 (too late!). This also allows to address how the user's workflow changes, i.e. which overhead is caused using the tool, in comparison to the original (sub-optimal) workflow. I think the paper has all of the components it needs, but I strongly suggest to reorder them to improve readability.

- Choice of the example(s): In Sec. 3 the authors first introduce the example of the simulation of an airplane wing, which is briefly discussed, but then dropped (why?) in favor of a "pizza example", before coming back to it in Sec. 3.2. Given the scope of the journal for the engineering sciences, I would have strongly preferred the illustration of the tool's capabilities with the wing simulation only (because: relevance). Interpreting persons as "processing steps" (line 197) and menu properties as "computational variables" appears overly complicated, if the alternative example can directly be the wing simulation process with parameters engineers readily relate to. I think the pizza example is redundant and can be omitted in favor of a more detailed CFD example. This is just an opinion, but I wanted to mention this for consideration by the authors.

- Robustness of the method: I think the authors should discuss how reliant the tool is on standardized file formats. Considering the pizza example: what would happen if the format/structure and strings on the menu change? Assuming the underlying process and its file formats frequently change, the user would have to adapt the regex patterns every time to instruct the parser and ensure proper data capture. If I understood correctly, every process to be tracked by the tool requires a-priori standardization and the user must provide proper instructions to help the parser retrieve the required information. I could imagine that there are applications where this is non-trivial and consumes a considerable amount of time.

Minor comments:

- General: HOMER is an abbreviation for "HPMC tool for Ontology-based Metadata Extraction and Re-use" - explaining one abbreviation (HOMER) with another (HPMC) is inconvenient - I had to look up the term HPMC before understanding what HOMER actually means. I suggest to avoid secondary abbreviations in the explanation, even if this makes the explanation longer when introducing it for the first time

- line 5/6: "Here comes Research Data Management to provide an efficient solution to these problems." - perhaps a bit picky, but "also bad research data management is research data management". Every scientist has a management strategy, whether it was consciously developed or not. If a specific RDM is actually "an efficient solution" depends on its details. Perhaps rephrase.

- lines 62/63: both Docker and Globus should have references.

- terminology: what is the conceptual difference between a "sharable dataspace" (page 2, line 48) and "very large data lakes" (page 3, line 57)? This is not explicitly mentioned in the paper and I doubt this is common knowledge. My best guess is that the authors adopt terminologies introduced by the authors of the reviewed tools (signac, Xtract), but I think they should aim for a consistent terminology throughout their own work. 

- Fig. 1: what is a "flat class"? This warrants further explanation.

- line 123: what is "edge mode (where the data are generated)" - (1) data has no plural, "data is generated", (2) I still do not understand what "edge mode" means, this warrants further explanation

- line 199: "metadata are extracted" should read "metadata is extracted" (see also previous comment) - please correct throughout the text.

- lines 201-211 represents file contents of the pizza menu (this is the menu correct? not fully clear from the upstream text). For this and all the following occurences of file contents, I suggest to use a separate code environment which is formatted separately and also has a caption.

- also the term "centralized mode" warrants further explanation

- line 390, typo: "code is designed"

- It is somewhat uncommon to introduce additional figures (i.e. Fig. 2) in the conclusion section. To me, Fig. 2 would have better been presented in when describing the code's characteristics, Sec. 2.1.

Invited Review Comment #85 Anonymous @ 2024-01-19 05:39

The authors of the paper „From Ontology to Metadata: A Crawler for Script-based Workflows; HOMER: an HPMC tool for Ontology-based Metadata Extraction and Re-use“present an approach to extract and use metadata within the workflows of High Performance Computing systems. The work presented in the paper is relevant in the current context of data explosion and complexity in managing this huge data and subsequent loss in keeping track and hence re-generating them from scratch. The authors provided good arguments on the necessity of a tool capable of understanding the semantics behind the data and their metadata. However, the paper is not an easy and smooth read. There were jumps at places which made the reading difficult at places.

The paper is presented more as a technical paper than a research one. It would have added value if the authors had presented the research significance on the application of semantics through tools such as HOMER in data generated through HPMC system rather than explaining Python script models in detail.

The paper presents the work involving multiple disciplines. Therefore authors should take extra care when introducing technical terminologies. Authors should explain the technical terms as simply as possible before their usage in the paper. For example the term „ontology“ appeared abruptly in line 19 might be new to experts in the field of fluid mechanics or the abbreviation CFD might be new to semantic experts. Authors should balance their usages not just by defining them but also by explaining how they appear in the overall picture of the research work. Moreover, in places the explanations provided are misleading: ontologies are generally not developed just to define controlled vocabularies, they are developed to provide a semantic background to data. Controlled vocabularies present semantics behind the vocabularies through knowledge organization systems (KOS) as Simple Knowledge Organization Systems or SKOS. Such KOS can be presented through Ontologies.  

Abbreviations should be expanded before their usage further in the paper: CFD (Computational Fluid Dynamics) workflows at line 30 should have been expanded first but are extended later on lines 166/167. The same goes for RDM (Research Data Management, line 40). Likewise, Metadata4Ing (Metadata for Ingenieur [Engineers], line 23) should have been explained in some detail.

The authors should use phrases to describe the technical terms as far as possible. For example,  the sentence „The code has been developed as a collection of routines, each performing a different action, rather than as a single script.“  within section 2.1, lines 113/114 could have been explained through the phrase modular routines. The phrase however is used in line 117 in the section.

Included here are some other inquiries and suggestions:

  • ·       The three keywords „path“, „type“, and „pattern“ cannot be seen in the actual HOMER implementation section. Therefore, are these keywords specifically meant for Pizza example?
  • ·       Do the sentences between 160 and 162 „This involves……simulation code is run again.“ mean there is a learning mechanism that trains for the parameter for the configuration file for later?
  • ·       In line 170 it should be aeroplane's not airplane's
  • ·       Is there any specific reason behind choosing Metadata4Ing? Why broadly used provenance ontology Prov-O is not used as an example? Most users might be able to relate with Prov-O as it is standardized by W3C and not with Metadata4Ing. M4I ontology is a rather new one.
  • ·       What is the impact of the choice of another ontology (e.g., Prov-O) on the five steps of the Homer process?
  • ·       Is „Pizza Ontology“ the same ontology provided by Stanford University in the Protégé tutorial?
  • ·       Is „Vegetarian Pizza“ „White Pizza“? It is a bit confusing here because vegetarian pizza is never mentioned as one consumed by the three users.
  • ·       In lines 279/280, it is mentioned that the (meta)data is extracted from plain text files, Will it be possible if data are available online?
  • ·       The texts in lines 325 to 326 are confusing. What do authors mean by intended use: is it meant for different days, different numbers of pizzas, or different numbers of users?
  • ·       what are menus 1-6 in line 327?
  • ·       Section 3.2 „Simple CFD like application“ should provide an actual insight of the workflow but the section is too short and not adequately explained. The authors should prioritize this section as the main engineering application of the HOMER tool.
  • ·       Is it necessary for the users to have an ontology file? What if the user does not have it what Will happen?
  • ·       What are those further user inputs required for centralized mode?
  • ·       It would help if the figure is on the same page as it is mentioned.
  • ·       Is it true that the users have to develop ontology based on standardized ontology like m4i?

 

 

 

 

Downloads

Download Preprint

Metadata
  • Published: 2023-04-27
  • Last Updated: 2024-04-22
  • License: Creative Commons Attribution 4.0
  • Subjects: Data Management Software
  • Keywords: Metadata extraction, HPMC, Research Data Management, Ontology
Versions
All Preprints