From Ontology to Metadata: A Crawler for Script-based Workflows

This is a Preprint and has not been peer reviewed. This is version 1 of this Preprint.

Authors

Giuseppe Chiapparino  , Benjamin Farnbacher, Nils Hoppe, Radoslav Ralev, Vasiliki Sdralia, Christian Stemmer 

Abstract

The present work introduces HOMER (HPMC tool for Ontology-based Metadata Extraction and Re-use), a python-written metadata crawler that allows to automatically retrieve relevant research metadata from script-based workflows on HPC systems. The tool offers a flexible approach to metadata collection, as the metadata scheme can be read out from an ontology file. Through minimal user input, the crawler can be adapted to the user's needs and easily implemented within the workflow, enabling to retrieve relevant metadata. The obtained information can be further automatically post-processed. For example, strings may be trimmed by regular expressions or numerical values may be averaged. Currently, data can be collected from text-files and HDF5 files, as well as directly hardcoded by the user. However, the tool has been designed in a modular way, so that it allows straightforward extension of the supported file-types, the instruction processing routines and the post-processing operations.

Comments

Comment #91 Kevin Logan @ 2024-02-05 09:55

As managing editor of ing.grid, I thank the reviewers for their conscientious and detailed reviews. In accordance with their recommendations, I advise the authors to revise the manuscript incorporating the changes suggested by the reviewers. The authors are expected to submit a detailed response to the reviewers, in which they make clear how they considered the reviewers' comments.

Invited Review Comment #87 Anonymous @ 2024-01-25 16:05

The authors present a metadata crawler tool called HOMER for script-based applications running on HPC systems. Their objective is a method for automatic retrieval of metadata. The tool is linked to an ontology file for appropriate terminology. The work originates from a project embedded in NFDI4ING, archetype DORIS.

Overall, the manuscript is well written and easy to follow, even though I think the overall structure of the paper needs a revision (see comments below). The topic of metadata retrieval is a very relevant question for research data management. Before I can recommend the paper for publication, the authors should consider the following major comments:

- Strengthen objective: In the introduction, the authors mention many other relevant metadata crawlers or extract tools. Together with a brief description of the respective tool's capabilities, the authors also mention limitations or shortcomings. When introducing their own tool, HOMER, the authors mention what the tool does. Moreover, in my view, the authors should strenghten their objective and specifically highlight what differentiates HOMER from the other previously mentioned tools (e.g. iCurate, signac, Xtract, ExtractIng, etc.). In other words: is there a difference in scope, why is this tool required, or why did they not use one of the other existing tools?

- Improve paper structure: In my view, the authors are describing a solution, i.e. the tool HOMER, before characterizing the problem in detail. This structure makes this paper difficult to read at times. It would be very beneficial for readers to (1) first illustrate a sample workflow without HOMER, (2) then highlight the critical points from a RDM perspective, and (3) finally illustrate the solution/improvement with HOMER. At the moment, aspects and reasoning about the tool remain somewhat abstract and the authors need to point to aspects discussed in downstream sections (not ideal) before it starts to make sense after Sec. 3.2 (too late!). This also allows to address how the user's workflow changes, i.e. which overhead is caused using the tool, in comparison to the original (sub-optimal) workflow. I think the paper has all of the components it needs, but I strongly suggest to reorder them to improve readability.

- Choice of the example(s): In Sec. 3 the authors first introduce the example of the simulation of an airplane wing, which is briefly discussed, but then dropped (why?) in favor of a "pizza example", before coming back to it in Sec. 3.2. Given the scope of the journal for the engineering sciences, I would have strongly preferred the illustration of the tool's capabilities with the wing simulation only (because: relevance). Interpreting persons as "processing steps" (line 197) and menu properties as "computational variables" appears overly complicated, if the alternative example can directly be the wing simulation process with parameters engineers readily relate to. I think the pizza example is redundant and can be omitted in favor of a more detailed CFD example. This is just an opinion, but I wanted to mention this for consideration by the authors.

- Robustness of the method: I think the authors should discuss how reliant the tool is on standardized file formats. Considering the pizza example: what would happen if the format/structure and strings on the menu change? Assuming the underlying process and its file formats frequently change, the user would have to adapt the regex patterns every time to instruct the parser and ensure proper data capture. If I understood correctly, every process to be tracked by the tool requires a-priori standardization and the user must provide proper instructions to help the parser retrieve the required information. I could imagine that there are applications where this is non-trivial and consumes a considerable amount of time.

Minor comments:

- General: HOMER is an abbreviation for "HPMC tool for Ontology-based Metadata Extraction and Re-use" - explaining one abbreviation (HOMER) with another (HPMC) is inconvenient - I had to look up the term HPMC before understanding what HOMER actually means. I suggest to avoid secondary abbreviations in the explanation, even if this makes the explanation longer when introducing it for the first time

- line 5/6: "Here comes Research Data Management to provide an efficient solution to these problems." - perhaps a bit picky, but "also bad research data management is research data management". Every scientist has a management strategy, whether it was consciously developed or not. If a specific RDM is actually "an efficient solution" depends on its details. Perhaps rephrase.

- lines 62/63: both Docker and Globus should have references.

- terminology: what is the conceptual difference between a "sharable dataspace" (page 2, line 48) and "very large data lakes" (page 3, line 57)? This is not explicitly mentioned in the paper and I doubt this is common knowledge. My best guess is that the authors adopt terminologies introduced by the authors of the reviewed tools (signac, Xtract), but I think they should aim for a consistent terminology throughout their own work. 

- Fig. 1: what is a "flat class"? This warrants further explanation.

- line 123: what is "edge mode (where the data are generated)" - (1) data has no plural, "data is generated", (2) I still do not understand what "edge mode" means, this warrants further explanation

- line 199: "metadata are extracted" should read "metadata is extracted" (see also previous comment) - please correct throughout the text.

- lines 201-211 represents file contents of the pizza menu (this is the menu correct? not fully clear from the upstream text). For this and all the following occurences of file contents, I suggest to use a separate code environment which is formatted separately and also has a caption.

- also the term "centralized mode" warrants further explanation

- line 390, typo: "code is designed"

- It is somewhat uncommon to introduce additional figures (i.e. Fig. 2) in the conclusion section. To me, Fig. 2 would have better been presented in when describing the code's characteristics, Sec. 2.1.

Invited Review Comment #85 Anonymous @ 2024-01-19 05:39

The authors of the paper „From Ontology to Metadata: A Crawler for Script-based Workflows; HOMER: an HPMC tool for Ontology-based Metadata Extraction and Re-use“present an approach to extract and use metadata within the workflows of High Performance Computing systems. The work presented in the paper is relevant in the current context of data explosion and complexity in managing this huge data and subsequent loss in keeping track and hence re-generating them from scratch. The authors provided good arguments on the necessity of a tool capable of understanding the semantics behind the data and their metadata. However, the paper is not an easy and smooth read. There were jumps at places which made the reading difficult at places.

The paper is presented more as a technical paper than a research one. It would have added value if the authors had presented the research significance on the application of semantics through tools such as HOMER in data generated through HPMC system rather than explaining Python script models in detail.

The paper presents the work involving multiple disciplines. Therefore authors should take extra care when introducing technical terminologies. Authors should explain the technical terms as simply as possible before their usage in the paper. For example the term „ontology“ appeared abruptly in line 19 might be new to experts in the field of fluid mechanics or the abbreviation CFD might be new to semantic experts. Authors should balance their usages not just by defining them but also by explaining how they appear in the overall picture of the research work. Moreover, in places the explanations provided are misleading: ontologies are generally not developed just to define controlled vocabularies, they are developed to provide a semantic background to data. Controlled vocabularies present semantics behind the vocabularies through knowledge organization systems (KOS) as Simple Knowledge Organization Systems or SKOS. Such KOS can be presented through Ontologies.  

Abbreviations should be expanded before their usage further in the paper: CFD (Computational Fluid Dynamics) workflows at line 30 should have been expanded first but are extended later on lines 166/167. The same goes for RDM (Research Data Management, line 40). Likewise, Metadata4Ing (Metadata for Ingenieur [Engineers], line 23) should have been explained in some detail.

The authors should use phrases to describe the technical terms as far as possible. For example,  the sentence „The code has been developed as a collection of routines, each performing a different action, rather than as a single script.“  within section 2.1, lines 113/114 could have been explained through the phrase modular routines. The phrase however is used in line 117 in the section.

Included here are some other inquiries and suggestions:

  • ·       The three keywords „path“, „type“, and „pattern“ cannot be seen in the actual HOMER implementation section. Therefore, are these keywords specifically meant for Pizza example?
  • ·       Do the sentences between 160 and 162 „This involves……simulation code is run again.“ mean there is a learning mechanism that trains for the parameter for the configuration file for later?
  • ·       In line 170 it should be aeroplane's not airplane's
  • ·       Is there any specific reason behind choosing Metadata4Ing? Why broadly used provenance ontology Prov-O is not used as an example? Most users might be able to relate with Prov-O as it is standardized by W3C and not with Metadata4Ing. M4I ontology is a rather new one.
  • ·       What is the impact of the choice of another ontology (e.g., Prov-O) on the five steps of the Homer process?
  • ·       Is „Pizza Ontology“ the same ontology provided by Stanford University in the Protégé tutorial?
  • ·       Is „Vegetarian Pizza“ „White Pizza“? It is a bit confusing here because vegetarian pizza is never mentioned as one consumed by the three users.
  • ·       In lines 279/280, it is mentioned that the (meta)data is extracted from plain text files, Will it be possible if data are available online?
  • ·       The texts in lines 325 to 326 are confusing. What do authors mean by intended use: is it meant for different days, different numbers of pizzas, or different numbers of users?
  • ·       what are menus 1-6 in line 327?
  • ·       Section 3.2 „Simple CFD like application“ should provide an actual insight of the workflow but the section is too short and not adequately explained. The authors should prioritize this section as the main engineering application of the HOMER tool.
  • ·       Is it necessary for the users to have an ontology file? What if the user does not have it what Will happen?
  • ·       What are those further user inputs required for centralized mode?
  • ·       It would help if the figure is on the same page as it is mentioned.
  • ·       Is it true that the users have to develop ontology based on standardized ontology like m4i?

 

 

 

 

Downloads

Download Preprint

Metadata
  • Published: 2023-04-27
  • Last Updated: 2023-02-07
  • License: Creative Commons Attribution 4.0
  • Subjects: Data Management Software
  • Keywords: Metadata extraction, HPMC, Research Data Management, Ontology
All Preprints