Evaluation of tools for describing, reproducing and reusing scientific workflows

This is a Preprint and has not been peer reviewed. A published version of this Preprint is available on ing.grid . This is version 3 of this Preprint.

Authors

Philipp Diercks , Dennis Gläser, Ontje Lünsdorf, Michael Selzer, Bernd Flemisch, Jörg F. Unger

Abstract

In the field of computational science and engineering, workflows often entail the application of various software, for instance, for simulation or pre- and postprocessing. Typically, these components have to be combined in arbitrarily complex workflows to address a specific research question. In order for peer researchers to understand, reproduce and (re)use the findings of a scientific publication, several challenges have to be addressed. For instance, the employed workflow has to be automated and information on all used software must be available for a reproduction of the results. Moreover, the results must be traceable and the workflow documented and readable to allow for external verification and greater trust. In this paper, existing workflow management systems (WfMSs) are discussed regarding their suitability for describing, reproducing and reusing scientific workflows. To this end, a set of general requirements for WfMSs were deduced from user stories that we deem relevant in the domain of computational science and engineering. On the basis of an exemplary workflow implementation, publicly hosted at GitHub (https://github.com/BAMresearch/NFDI4IngScientificWorkflowRequirements), a selection of different WfMSs is compared with respect to these requirements, to support fellow scientists in identifying the WfMSs that best suit their requirements.

Comments

Comment #31 Achim Streit @ 2023-07-06 12:41

As the responsible topical editor, I would like to thank the authors for revising their paper and the two reviewers for their second review.

My final decision is: acceptance of the revised paper.

I would like to request the authors to upload the final version as a compressed LaTeX folder as described here: https://www.inggrid.org/site/editorguidelines/

Invited Review Comment #30 Anonymous @ 2023-07-05 09:23

Thanks for the authors for addressing all issues or explaining their approach / different opinion. I have no further comments.

Invited Review Comment #25 Ivan Kondov @ 2023-05-03 17:06

The authors have addressed all recommendations and raised issues, and have adapted/corrected the manuscript accordingly. I have nothing to add.

One comment regarding Answer 7 (referee #3). As in my original review, I do not insist on extending the set of user stories because this would shift the originally intended scope of the paper and would not necessarily improve it. Nevertheless, the suggested user stories show how strongly an evaluation study can depend on the choice of these stories. The user stories that I have suggested are from my own experience with HPC workflows. Regarding "..., we think that the modification of the workflow definition during the execution is very difficult to implement.": These two features, modifying individual steps and adding further steps after enactment and during workflow execution, have been implemented in FireWorks, Ref. 24 in the manuscript. Regarding "... steps that have already been completed could be modified and what should the WfMS then do?": The policy applied is to invalidate the outputs of the modified workflow steps and these of all dependent (child) workflow steps by archiving them and all these steps are re-scheduled and re-launched. Obviously, outputs that are used as inputs elsewhere in the workflow may not be manually modified.

Comment #22 Dennis Gläser @ 2023-04-21 12:02

We would like to thank the referees for their time taken to carefully read the manuscript and the many valuable comments that helped to improve the manuscript. The referees comments, our answer and a description of the changes made to reflect those comments in the manuscript are given below. The changes are highlighted in color in the revised version of the manuscript. Minor changes regarding grammar or spelling are incorporated without listing them here.

Comments by referee #1: No points raised.

Comments by referee #2:

1. “Some links in the references did not work and should be checked (e.g. reference 20).” Response: The links in the references have been checked and should work now as intended.

2. “There is no statement of the contributions of the authors.” Response: The roles and contributions of the authors are stated in section 9 “Roles and contributions”.

3. “Only three user stories are taking into account for the definition of the requirements. Though these three user stories are quite different its not clear to me that they are able to take into account all varieties. As part of NFDI4Ing the authors should be able to get more user stories.” Response: The three user stories are based on the authors’ experience as stated in line 28. Therefore, these are of course subjective, but nevertheless designed to cover a wide range of aspects (requirements) as described in section 2 and 3. After the initial design 1 of the user stories, a survey within a special interest group (within NFDI4Ing) regarding workflow management systems was conducted. At that time, all researchers could identify themselves with the three user stories. Therefore, we decided against including more user stories. (No changes made)

4. “In the introduction the authors state that they want to ”discuss how WfMSs can contribute to the transparency, adaptability and reproducibility of computational research”. The authors only evaluated the WfMS regarding the requirements they extracted from the user stories. The RDA working group FAIR4RS and others groups (some were even cited in the paper) also defined criteria for FAIR software or workflows. Why were these not taken into account to get a full view on the topic?” Response: We have expanded the introduction to make this point more clear and distinguish this work from e.g. the FAIR4RS principles. In particular, we have outlined which aspects of the FAIR principles are addressed in this work, and have highlighted that. While the FAIR4RS principles are on a rather high level, this work investigates how specific features of WfMS may contribute to a more FAIR research software landscape. We have also elaborated this further in Section 3.

5. “The authors selected five WfMS - why were these selected? Are these the ones widely used worldwide in the computational science and engineering community?” Response: We expanded on this in the introductory paragraph of Section 5.

6. “The requirements list is a mixture of requirements for FAIRness and for ”better user experience” (like graphical user interface, ease of first use). It would be better to clearly seperate them in the evaluation - which of them contribute to ”transparency, adaptability and reproducibility”? ” Response: We expanded the introductory paragraph of section 3 with a general remark how we think the discussed features of WfMS may contribute to transparent, reproducible and reusable computational workflows. Besides this, we added specific references to these aspects in the text of the individual requirements, where possible.

7. “The importance of the single requirements should be clearly stated. Maybe they should be weighted? ” Response: We made a conscious decision not to weight the different requirements, because the importance of each individual requirement strongly depends on the use case. We believe that the choice for one WfMS or the other is strongly subjective and depends on the particular application and the preferences of its developers (as stated in lines 613-615). By introducing such a weighting and a point scale, the readers are tempted to think that there is an overall winner among the tools, which we do not think is true, and therefore we did not introduce such a weighting. (No changes made)

Comments by referee #3:

1. “Two authors have to specify their ORCIDs according
to the requirements.” Response: The missing ORCIDs were added.

2. “The locators of both data and software are the same. In addition, there should be a persistent identifier (such as DOI). Suggestion: Because the software is already in github it can be readily published in zenodo to obtain a DOI (and provide relevant metadata needed for citation)” Response: We published the current version of the repository via zenodo with DOI https://doi.org/10.5281/ zenodo.7790634. This information is included in a citation file in the github repository, which is also linked in the Zenodo dataset. Given the size of the repository, we did not see the benefit of separating data and code and published the entire repository in a single dataset.

3. “The work is well motivated in the introduction. Because the paper occurs in the context of computational materials science, the authors might find interesting to consider a recent work, aiming at comparison of different workflow management systems in energy storage and conversion applications” Response: A reference to the suggested work was added in the introduction.

4. “The sentences in lines 28-32 should include references to sections 2, 3 and 5. The use case (Section 4) should be mentioned / motivated in the introduction.” Response: The references were included and section 4 was mentioned as well.

5. “Figure 1: The workflow and the individual workflow process have only one input each. This is a substantial restriction for the workflow model. A workflow and a workflow step (called process in this paper) may have multiple inputs (see e.g. steps 5 and 6 in Figure 2). A similar specification is found in Figure 1 of Ref. 21. There should be a comment on this.” Response: We have added a comment to the caption of Figure 1, stating that each process (workflow step) as well as the workflow as a whole can have multiple inputs and outputs.

6. “Lines 121-122: Also in industrial research settings, the steep learning curve (the starting difficulties) is an issue.” Response: This was incorporated accordingly.

7. “One aspect not covered by the user stories, and closely related to the up-to-dateness and the reuse of the individual workflow components, is the possibility to modify a workflow that is already in execution. This is especially helpful because, as in the paper it is pointed-out, HPC jobs take long time to finish. Therefore these two user stories: 1) The user does not want to discard completed steps that have been worth many thousands of core hours due to faulty or missing further steps discovered only after workflow enactment. 2) Even when a workflow has been completed, the user wants to extend an existing workflow later in order to reuse outputs from the individual steps. ” Response: Thank you for that comment. While this is certainly an interesting idea, we think that the modification of the workflow definition during the execution is very difficult to implement. Parsing of the workflow definition script and execution are usually two subsequent steps and therefore this use case is (at least at the moment) out of scope in our opinion. For instance, the modifications made could be faulty, causing re-parsing to fail. Also, in principle steps that have already been completed could be modified and what should the WfMS then do? It is possible, however, to modify any input data (to a process that is not yet running) while the workflow is being executed. Data that is passed from previous steps cannot be modified, as it is the result of a computation. Also, the execution of the workflow could always be aborted, steps modified, and the workflow rerun from some intermediate result — if the WfMS supports to recompute the dependencies — which would cover at least the second point.

8. “Requirements 3.3, 3.10 and 3.11 are not explicitly sup- ported by the user stories.” “Requirements 3.3 and 3.10 are related and should be more explicitly motivated by the user stories. ” Response: The referenced requirements are now mentioned in the user stories and the relation between 3.3 and 3.10 is made more clear.

9. “The requirement 3.9 (Ease of first use) is very subjective and depends on the experience and skills of the user. Actually, a measure of these skills would be a better indicator. The ”Ease of use” should be treated statistically, for example, by having a questionnaire answered by many users of the specific tool. Therefore, the ease of use found here for the six evaluated tools may not be a very valuable finding. In my opinion, this should be revised.” Response: Thank you for that valuable comment. We made the following changes to address this point. We stated in section 3.9 that the requirement is subjective, but we also made an attempt to make the criteria more valuable on the basis that basic programming skills of the user can be assumed. However, we did not attempt to revise this by conducting a survey of the users of the tools, which is out of scope of the present work. This is given by the fact that this could be a good topic for further research as you stated yourself. In accordance to that a comment was added in the outlook.

10. “Lines 617-629: The recommendation regarding ease of use are based on the experience of the authors. See my comment about Section 3.9. This has to be made very clear in the summary.” Response: We added a corresponding comment and reference to section 3.9. See also the previous item.

11. “A better measurable ”ease of use” seems to be a good topic for a future paper.” Response: A corresponding paragraph was added in the outlook. See also the previous items related to this.

12. “The use case seems to be more like a use-case driven benchmark for evaluating predefined requirements than a real use case. These sentences support this: “This example is considered to be representative for many problems simulating physical processes in engineering science using numerical discretization techniques.”. “Most steps transfer data among each other via files, but we intentionally built in the transfer of the number of degrees of freedom as an integer value to check how well such a situation can be handled by the tools.”. Probably, the wording ”use case” can be adapted or changed. ” Response: The wording ”simple use case” was adapted to ”exemplary workflow”.

13. “There is no comment on how the set of tools has been selected for evaluation. Having in mind the great number of existing tools (see for example this listing) the selection made by the authors should be justified. Alternatively a comment should be written to make clear that the evaluation is exemplary and is provided here as a method, and not intended to provide a representative picture of this complex landscape. ” Response: We have expanded on this in the introductory paragraph of Section 5.

14. “One general remark about the tool comparison. CWL and GWL are, strictly speaking, languages and no tools such to be compared to the others in the evaluation. A clear distinction should be made between domain specific languages and their supporting tools (such as editors, parsers, interpreters, engines etc.) throughout the paper. For example, in the GWL case, Guix is obviously the tool but in the CWL case, I cannot identify which tool has been evaluated. In lines 435-437, it is written that a number of different engines support CWL. ” Response: We did not bind our evaluation to a specific tool (interpreter) in the case of CWL, because we think it is important that the variety of tooling around CWL is reflected in the evaluation. We have added an explanation on this in the first paragraph of Section 5.2. Moreover, we added a note in the section on GWL to incorporate the comment above.

15. “Lines 447-448: ”Moreover, there exist workflow engines for CWL that support using job managers like e.g. Slurm”. There should be a citation related to the CWL workflow engines supporting Slurm but only Slurm is cited.” Response: Citations of exemplary engines that support Slurm were added.

16. “Lines 477-486: Not clear for which requirement this discussion is relevant. ” Response: This is relevant for the requirement ”Process interfaces” and the text was changed to make this more clear.

17. “Table 1: In the definition of the requirements (lines 240-243) there were only two levels for data provenance but in Table 1 three levels are evaluated. This must be corrected” Response: The table was corrected accordingly.

18. “In Ref. 21 the DOI is missing and the source is hard to find without the DOI. The DOI is 10.5334/dsj-2022-016 and should be included.” Response: Thank you. The DOI was included.

19. “Refs. 20 and 31 contain links that are invalid (I guess inserted to escape the backslash in Latex). These DOIs and URLs must be corrected” Response: The links were corrected.

Comment #20 Achim Streit @ 2023-03-24 08:13

As the responsible topical editor, I would like to the three reviewers for their detailed and constructive feedback. After consideration of the comments, I advise the authors to revise the paper according to the suggestions provided in the reviews. After completion, the new version of the paper should be uploaded again and will be given again to the reviewers. Thank you.

Invited Review Comment #15 Ivan Kondov @ 2023-02-23 10:48

The authors suggest an original approach to compare workflow management systems (WfMSs) systematically based on user stories and a use case (a designed benchmark). The user stories have been then used to define requirements to WfMSs, also called “workflow tools” in sections 2-5. A set of workflow tools has been evaluated based on the use case. In the summary and the outlook, the authors discuss their findings from the evaluation and suggest further possible developments of workflow management systems, respectively.

The paper is very well written and understandable. Especially, I found the exposition of the user stories and the resulting requirements very interesting. One critical point is how the authors treat the ease-of-use issue (see below). In my opinion, the paper should undergo a minor revision. Here my comments:

Front matter

Two authors have to specify their ORCIDs according to the requirements.

The locators of both data and software are the same. In addition, there should be a persistent identifier (such as DOI). Suggestion: Because the software is already in github it can be readily published in zenodo to obtain a DOI (and provide relevant metadata needed for citation).

Section 1

The work is well motivated in the introduction. Because the paper occurs in the context of computational materials science, the authors might find interesting to consider a recent work, aiming at comparison of different workflow management systems in energy storage and conversion applications.

The sentences in lines 28-32 should include references to sections 2, 3 and 5. The use case (Section 4) should be mentioned / motivated in the introduction.

Section 1.1

Figure 1: The workflow and the individual workflow process have only one input each. This is a substantial restriction for the workflow model. A workflow and a workflow step (called process in this paper) may have multiple inputs (see e.g. steps 5 and 6 in Figure 2). A similar specification is found in Figure 1 of Ref. 21. There should be a comment on this.

Section 2

Lines 121-122: Also in industrial research settings, the steep learning curve (the starting difficulties) is an issue.

One aspect not covered by the user stories, and closely related to the up-to-dateness and the reuse of the individual workflow components, is the possibility to modify a workflow that is already in execution. This is especially helpful because, as in the paper it is pointed-out, HPC jobs take long time to finish. Therefore these two user stories: 1) The user does not want to discard completed steps that have been worth many thousands of core hours due to faulty or missing further steps discovered only after workflow enactment. 2) Even when a workflow has been completed, the user wants to extend an existing workflow later in order to reuse outputs from the individual steps.

Section 3

Requirements 3.3, 3.10 and 3.11 are not explicitly supported by the user stories.

The requirement 3.9 (Ease of first use) is very subjective and depends on the experience and skills of the user. Actually, a measure of these skills would be a better indicator. The "Ease of use" should be treated statistically, for example, by having a questionnaire answered by many users of the specific tool. Therefore, the ease of use found here for the six evaluated tools may not be a very valuable finding. In my opinion, this should be revised.

Requirements 3.3 and 3.10 are related and should be more explicitly motivated by the user stories.

Lines 339-340: The text "in particular the" should be either "particularly in the" or "in particular in the" or "and in particular the".

Line 347: "paper someone" should be "paper that someone"

Section 4

The use case seems to be more like a use-case driven benchmark for evaluating predefined requirements than a real use case. These sentences support this: “This example is considered to be representative for many problems simulating physical processes in engineering science using numerical discretization techniques.”. “Most steps transfer data among each other via files, but we intentionally built in the transfer of the number of degrees of freedom as an integer value to check how well such a situation can be handled by the tools.”. Probably, the wording "use case" can be adapted or changed.

Section 5

There is no comment on how the set of tools has been selected for evaluation. Having in mind the great number of existing tools (see for example this listing) the selection made by the authors should be justified. Alternatively a comment should be written to make clear that the evaluation is exemplary and is provided here as a method, and not intended to provide a representative picture of this complex landscape.

One general remark about the tool comparison. CWL and GWL are, strictly speaking, languages and no tools such to be compared to the others in the evaluation. A clear distinction should be made between domain specific languages and their supporting tools (such as editors, parsers, interpreters, engines etc.) throughout the paper. For example, in the GWL case, Guix is obviously the tool but in the CWL case, I cannot identify which tool has been evaluated. In lines 435-437, it is written that a number of different engines support CWL.

Line 434: "parse the standard" should be "parse the input" or "support the CWL standard"

Lines 447-448: "Moreover, there exist workflow engines for CWL that support using job managers like e.g. Slurm". There should be a citation related to the CWL workflow engines supporting Slurm but only Slurm is cited.

Lines 477-486: Not clear for which requirement this discussion is relevant.

Table 1: In the definition of the requirements (lines 240-243) there were only two levels for data provenance but in Table 1 three levels are evaluated. This must be corrected.

Table 1: "Ease-of-first-use" should be "Ease of first use"

Section 6

Lines 617-629: The recommendation regarding ease of use are based on the experience of the authors. See my comment about Section 3.9. This has to be made very clear in the summary.

Section 7

Line 654: "challenges we have" should be "challenges that we have"

Line 683: "each of the tools allows" should be "all tools allow"

A better measurable "ease of use" seems to be a good topic for a future paper.

References

In Ref. 21 the DOI is missing and the source is hard to find without the DOI. The DOI is 10.5334/dsj-2022-016 and should be included.

Refs. 20 and 31 contain links that are invalid (I guess inserted to escape the backslash in Latex). These DOIs and URLs must be corrected.

Invited Review Comment #14 Anonymous @ 2023-01-31 09:44

Summary:
The authors present a study on workflow management systems (WfMS) for use in computational science and engineering. Their focus - according to the introduction - is on describing, reproducing and reusing scientific workflows as one aspect of FAIR research software. Based on user stories they define a set of requirements on workflow systems and compare existing solutions.

Strenghts:
- Taking a start at this important topic. Workflows are common in science and engineering as one part of research software and thus need ot contribute to the FAIR aspects.
- Defining requirements and evaluating existing solutions to help researchers in making their workflows FAIR.
- The definition of the requirements on the workflow systems is also based on user stories.
- The authors implmented an example workflow to experimentally test their requirement list on several WfMS.

Weaknesses:
- Only three user stories are taking into account for the definition of the requirements. Though these three user stories are quite different its not clear to me that they are able to take into account all varieties. As part of NFDI4Ing the authors should be able to get more user stories.
- In the introduction the authors state that they want to "discuss how WfMSs can contribute to the transparency, adaptability and reproducibility of computational research". The authors only evaluated the WfMS regarding the requirements they extracted from the user stories. The RDA working group FAIR4RS and others groups (some were even cited in the paper) also defined criteria for FAIR software or workflows. Why were these not taken into account to get a full view on the topic?
- The authors selected five WfMS - why were these selected? Are these the ones widely used worldwide in the computational science and engineering community?
- The requirements list is a mixture of requirements for FAIRness and for "better user experience" (like graphical user interface, ease of first use). It would be better to clearly seperate them in the evaluation - which of them contribute to "transparency, adaptability and reproducibility"?
- The importance of the single requirements should be clearly stated. Maybe they should be weighted?

Writing issues:
The paper is easy to read and understandable.
Some links in the references did not work and should be checked (e.g. reference 20).
There is no statement of the contributions of the authors.

Conclusion:
The topic is of importance for FAIR research software and I think the authors are on the right way. Nevertheless, to be of more use for the community, some more work is needed. Thus I would reject the current state of the work but encourage the authors to come back with a refined paper.

Invited Review Comment #13 Anonymous @ 2023-01-20 10:09

This paper presents an excellent overview of the requirements and the practical problems associated with running scientific workflows in typical heterogeneous environments from the researcher's local machine up to cluster and HPC systems. The authors present a very practical approach to comparing different workflow management systems (WMS): a concrete, non-trivial workflow is implemented and executed using different WMS, allowing to compare the different systems using well-chosen criteria. The workflow used for comparing as well as the different implementations are available publicly, and one can only hope that more workflow management systems that are used in the scientific community (e.g. Apache Airflow, Elyra etc) are added to this comparison in the future.

Downloads

Download Preprint

Metadata

Published: 2022-12-06
Last Updated: 2023-07-24
License: Creative Commons Attribution 4.0
Subjects: Data Management Software
Keywords: FAIR, reproducibility, scientific workflows, tool comparison, workflow management

Versions

All Preprints