Agile Research Data Management with Open Source: LinkAhead

This is a Preprint and has not been peer reviewed. A published version of this Preprint is available on ing.grid . This is version 8 of this Preprint.

Authors

Daniel Hornung  , Florian Spreckelsen  , Thomas Weiß 

Abstract

Research data management (RDM) in academic scientific environments increasingly enters the focus as an important part of good scientific practice and as a topic with big potentials for saving time and money. Nevertheless, there is a shortage of appropriate tools, which fulfill the specific requirements in scientific research. We identified where the requirements in science deviate from other fields and proposed a list of requirements which RDM software should answer to become a viable option. We analyzed a number of currently available technologies and tool categories for matching these requirements and identified areas where no tools can satisfy researchers' needs. Finally we assessed the open-source RDMS (research data management system) LinkAhead for compatibility with the proposed features and found that it fulfills the requirements in the area of semantic, flexible data handling in which other tools show weaknesses.

Comments

Comment #72 Christian Stemmer @ 2023-11-27 11:32

The reviewers have accepted the current version of the manuscript and we are happy to go forward with the publication of the article.
Thanks to the reviewers and the authors taking up the suggestions put forth.

Invited Review Comment #71 Torsten Bronger @ 2023-11-23 12:14

My concerns have been addressed by the authors. In my opinion this article is ready as regards its form and content.


One minor thing: There is a spelling mistake at „Scool“.

Comment #59 Daniel Hornung @ 2023-10-09 10:02

# Answer to Review 52 #

We would like to thank the reviewer Torsten Bronger for the extended comments and we hope to address the remaining questions to the fullest satisfaction.

## Preface: Name change CaosDB -> LinkAhead ##

In the time between the previous revision and this one, "CaosDB" was renamed to "LinkAhead". This name change reflects the broader audience ("CaosDB" was originally developed in the area of nonlinear dynamics, but generally the tongue-in-cheek name caused more confusion than laughs for people who did not have their roots in chaos theory). Therefore the name has been updated throughout the article. We do not expect any semantic consequences from this name change.

## Distinction between LinkAhead and ELNs ##

The main question is about the distinction between LinkAhead and Electronic Lab Notebooks. For the purpose of this article, we employ a narrower definition of ELNs, which we now also incorporate into the text (4.1). In the current sense of the article, solutions like LinkAhead, Nomad, JuliaBase are not ELNs because they do not serve as a replacement of paper notebooks in the lab. We now write however that some products which do not fall into the narrow ELN definition may have ELN modules (e.g. Nomad or Kadi4Mat) or that LinkAhead could be used as the base to develop an ELN, by adding a more streamlined user interface for an increased laboratory user experience.

## Feature sets ##

We expanded the table in 4.2.1 according to the new information given by the reviewer. Additionally there is now a note explaining that due to the large number of existing software solutions, it is difficult to assign unambiguous values. Therefore, half-filled table cells may denote that some products may fulfill the criteria. As a side note, we would be interested in the future in learning more about JuliaBase, which we did not know before.

We still hold however that LinkAhead completely fulfills R1 and R2, in that the data model firmly connects semantic meaning to data and links between data, and that the data model may be changed at any time without severing those semantic links.

We defer the decision about the table numbering to the editor, "Table 1" was LaTeX's default.

## Summary ##

We believe to have adressed the questions raised by the reviewer by inserting a more rigorous definition of the term "Electronic Lab Notebook" for the sake of this article. We also incorporated additional information into the table and solution comparison.

Additionally we replaced the name "CaosDB" by "LinkAhead" throughout the article and added a short section explainng the name change.

Comment #53 Christian Stemmer @ 2023-09-15 16:48

Thanks for the second round of reviews. The authors have been asked to answer comment#52.

Invited Review Comment #52 Torsten Bronger @ 2023-09-15 16:07

The new section 4 in the manuscript is very helpful, although it compares things of very different kinds.

In this iteration of the manuscript I want to focus my comments on the issue which was most important to me, namely the distinction between ELNs and CaosDB.

The distinction between ELNs and RDMSes does not become clearer to me. I even believe that it does not make sense. “ELN” has become a general term for all kinds of software that accompanies data management in the research process. In this sense, CaosDB is one incarnation of such software. (Probably a good one.)

I have serious trouble with the ELN line in table 4.2.1 on page 8. (It is called only table 1 in the caption.) ELNs cover all requirements if you use a good product. Well, R1 and R2 somewhat exclude each other, but this is a compromise each solution must make, with CaosDB being no exception. The white circles at R2, R6, and R10 are incorrect, as I know of ELNs offering that (even in the same product). For instance, JuliaBase (full disclosure: this is my baby) has R2 through its “result processes”, R6 through its “crawlers” and the “remote client”, and R10 due to its links to the data files in the file system.  And I know that Kadi4Mat covers everything except for R6, which I am simply not really sure about.

Thus, I still recommend to avoid setting CaosDB apart from ELNs, and instead focus on its feature set and how it helps researchers. It is probably a good solution, but not a separate kind of solution. And table 4.2.1 needs to be corrected.

Invited Review Comment #50 Marius Politze @ 2023-09-08 11:46

Thanks to the authors for providing an updated version of their article. My concerns have been fully adressed.

Comment #49 Marius Politze @ 2023-09-07 10:29

Thanks to the authors for providing an updated version of their article. My concerns have been fully adressed.

Comment #44 Christian Stemmer @ 2023-08-08 13:48

We thank the authors for the revision of the paper which has been deferred to the reviewers to check whether the revision meets the concerns raised.

Comment #34 Daniel Hornung @ 2023-07-13 17:52

# Answer to Review #21 #

We would like to thank the reviewers for their time and valuable comments. Their critical view helped us to revise the article and we hope to have addressed all the remarks to their satisfaction.

Here we answer the points raised by Marius Politze on 2023-04-20:

## Overview ##

The reviewer noted that the overall strcuture of the article was unclear about the main message of the authors. We restructured the manuscript to give the main topics more weight: requirements for scientific research data management, how current technologies and tools compare in the light of these requirements, and if CaosDB is a viable alternative to other approaches.

## List of remarks ##

- It does not get clear how CaosDB actually fulfills the requirements?
- We removed the confusing distinction between challenge and features and instead introduced a list of requirements with a consistent numbering scheme. These numbers are now used throughout the article to refer to specific requirements.
- How does CaosDB being Open Source help its users? (Contributions, community, roadmap)
- To us, the most important aspect of Open Source software is the impact on long-term sustainability which we added as (R4).
- We added information about contribution workflows, user and developer community, and feature plans to the article.
- Where and how does it compete against relational databases, or document based databases?
- We added technical comparison of CaosDB and SQL/NoSQL data bases to the "tools landscape" section.
- We also added information about topics where CaosDB underperforms compared to plain SQL approaches and which remedies may be available for these cases.
- Relationship between CaosDB and the FAIR principles remains largely unclear.
- We made it clear that existing tools and CaosDB alike can *enable* FAIR data management, but they cannot enforce it.
- We added two paragraphs in "5.3 Critical evaluation and outlook" about how CaosDB relates to ontology management and FAIR data. Specifically we emphasize that CaosDB can be used as a tool to implement FAIR data management, but that it requires users to customize it to their (field) specific needs and to use it accordingly.
- Long term interoperability, relation to existing standards, interoperable data collections rather than the next data silo?
- We noted that CaosDB uses standardized formats for data exchange and that further standardization, for example export to ro-crate compatible formats, are planned.
- We did not elaborate too much on this topic and the specifics of FAIR data management to keep the article short, but we are open to add another section on this if the reviewer expects this aspect to be of relevance to the broader audience.
- CaosDB is not the first and certainly not the last system trying to cover this area. The article presents CaosDB as an example of research software engineering, [...] Technology wise other approaches like document oriented databases (elasticsearch opensearch), data migrations for RDMS or schema or profile based (linked data) metadata stores are widely available.
- We added emphasis on where CaosDB is fundamentally different from other approaches and put structure to this question by systematically comparing other technologies to the requirements list, as shown in the table. We agree that the point of structured-yet-flexible data management is difficult to convey and thank the reviewer for pointing this out. We hope that the current form of the article helps to better guide the readers.
- Personally I do not like the mix of URLs in footnotes, in the text and in the references. It should be checked with the editors if there is a policy.
- We could not find a policy but agree that the mix was confusing. We changed to the document so that URLs of immediate practical usefulness to the readers are in the text, all others are references now.
- Misleading Title: Open Source [...]
- We believe to have addressed this comment above.
- Section 2: workflow from personal experience
- We added references here and throughout the manuscript where we had the feeling that they may be useful to the readers.
- Section 3: proposed features vs. previously derived requirements
- We agree about the unclearness. The new manuscript structure should have overcome this issue.
- Section 4: Generally missing orientation towards standards e.g., DCAT, Prov
- We intend the first paragraph of "5.3 Critical evaluation ..." to clarify that CaosDB is a well-suited framework to implement semantic web / ontology standards such as those by W3C, but that it does not come with pre-defined structures (yet) which already contain the structures intended by said standards.
- Section 4.2 / 4.3: Missing discussion about evaluation and quality assurance of data[...]
- We added a few notes to the appendix (7.2) which say that some CaosDB libraries have provisions to ascertain a certain guarantees.
- We added a paragraph in 5.1.1 which explains the developers' design choice to favor flexibility over strict consistency in those cases where the two show conflicts.
- Section 4.2: [...] bijective mapping layer like R2RML or RML would need to be introduced to foster interoperability [...]
- We agree that these mapping languages certainly have some use for publishing of data to targets which use RDF. They may be used for simple cases of inputs where no complex transformation takes place. For these cases however, there exists a YAML specification for the CaosDB crawler which we now mention in 5.1.2.
- Section 4.3: [...] list of features without relation to the previous [...] requirements.
- We added a more concise and user-friendly list with the relations in 5.2
- Section 5: Missing a critical evaluation
- This is now in section 5.3.
- Section 7: Appendix Query Language Comparison
- We added a very short paragraph which lines out how SPARQL and CaosDB's query language differ. Again, if the reviewer expects the readership to be interested in this, we can elaborate more on this topic.
- We corrected the imprecise "country" / "citizenship" wording to use the same semantics in both examples.
- We fixed the typos and other minor issues.
- The indentation in the CaosDB sub projects was intentional, "Advanced Python Tool" and "Crawler" are both Python libraries which are built upon the Python library.

We hope that the article is clearer now and look forward to the next round of reviews.

Invited Review Comment #24 Torsten Bronger @ 2023-04-28 16:28

I fully agree with the premiss of the article that data management tools must have a flexible data model and practical use.  Moreover, I understand that CaosDB provides that data model and good functionality to search in it, which is an important use case. I am convinced that CoasDB fulfils its requirements and is good software per se.

Still, I don’t really understand which niche CaosDB is aiming at.  I see a huge functional overlap with ELNs (electronic lab notebook).  In fact, I consider CaosDB an ELN (without judging its quality as such).  However, the authors clearly distinguish between the terms ELN and RDMS.  This may make sense, but I don’t see an explanation that sets RDMS’es apart from ELNs.

Be that as it may, these days, an institute typically deploys an instance of an ELN and manages its data with that ELN.  ELNs can provide custom fields per record, can automatically import exterimental data via crawlers, and can search in the data (without SQL).  This all works decently for most users, in my experience.

Again, CaosDB may well be a fine tool for RDM, but as a reader I have difficulties in understanding its specific “short-term advantages” (quote from the article).  If it really is another beast than an ELN, and sort sort of umbrella for other ELNs and data sources – what is the benefit for the researcher?


Minor remarks

Line 189: Please add a note that this ELN integration is in a very early phase of development.

Figure 4: I think some readers would want to know why MariaDB was chosen for the backend rather than MongoDB. The latter is much closer to CaosDB’s memory model after all – just some restrictions would have to be imposed on the application level.

Lines 230–239 look odd to me.  They introduce a new issue, whereas the conclusions should wrap up the article.  I assume it is meant as an outlook, but then, it should be introduced as such.

Invited Review Comment #21 Marius Politze @ 2023-04-20 14:40

Content:

The article "Agile Research Data Management with Open Source: CaosDB" presents a requirements analysis process for "CaosDB", a research data management database system to link research data and store research metadata using mostly unstructured records with flexible properties. After a short introduction into the topic of RDM, the authors propose a schematic and exemplary research data life cycle and deduce their main requirements for their implemented software in section 2. To cover the requirements, the authors derive a set of features that their software should cover in section 3. Section 4 discusses architectural and structural decisions for CaosDB and very briefly points out some prominent features. The article closes with a brief conclusion and appendixes referencing the source code and giving a comparison of SPARQL and a CaosDB query.

Overall Evaluation:

The article is well written and understandable. However, it leaves me an impression to be more like a tool presentation than a scientific article. While this is not uncommon in the field of RDM, I am still missing at least a critical evaluation respecting the following:

  • Requirements are derived from a scientific workflow, features are proposed, and some are implemented by CaosDB. It does not get clear how CaosDB actually fulfills the requirements?
  • Being Open Source is prominently managed in the title, however it is not put into relation to the analyzed requirements: How does CaosDB being Open Source help its users? Especially
    • How are contributions managed?
    • How big is the community?
    • Is there a core maintenance or an overarching roadmap / vision? How can a user influence that?
  • Where and how does it compete against relational databases, or document based databases like mongoDB or OpenSearch when used in practice, e.g., with billions of datasets. Where does it excel, where does it fall short?
  • FAIR principles are mentioned in the keywords and the paper on FAIR principles is referenced in the Introduction. The relationship of CaosDB's features to the FAIR principles remains largely unclear. Especially (but not limited to):
    • F1. (Meta)data are assigned a globally unique and persistent identifier
    • I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
    • I2. (Meta)data use vocabularies that follow FAIR principles
  • How does CaosDB ensure long term interoperability, e.g., with respect to a research data commons infrastructure? How does it relate to existing standards, and how does it help to build interoperable data collections rather than the next data silo?

Relevance:

Simplification of queries and a certain agility in the data models certainly are in my experience a non-neglectable issue for uses' acceptance of RDM metadata storage systems. CaosDB is not the first and certainly not the last system trying to cover this area. The article presents CaosDB as an example of research software engineering, where a product developed by a group of researchers eventually emerges as a tool for a bigger and cross-discipline community. The article also shows what I see as some of the major challenges: the adoption of standards and research data common infrastructures for interoperability and mid to long term availability as the tools grow. Technology wise other approaches like document oriented databases (elasticsearch / opensearch), data migraions for RDMS or schema or profile based (linked data) metadata stores are widely available.

Presentation quality:

  • Overall presentation quality is decent. Figures are legible. The formatting is according to the template.
  • Personally I do not like the mix of URLs in footnotes, in the text and in the references. It should be checked with the editors if there is a policy.

Major Improvements:

  • Misleading Title: While Open Source tool chains are generally desirable for RDM, throughout the text it does not clear where a user of CaosDB profits from it being Open Source. There are other mentioned features of CaosDB that seem to be much more prominently discussed in the article.
  • Section 2
    • The presented data worklfow likely comes from personal experience, a reference to some empirical backing that other labs have comparable processes would be desirable
  • Section 3
    • The section lists several proposed features, however it remains unclear how they relate to the previously derived requirements.
  • Section 4
    • Generally missing orientation towards standards e.g., DCAT, Prov
    • Section 4.2 / 4.3: Missing discussion about evaluation and quality assurance of data, based on the demo available I assume this is done by RecordTypes
      • Again standards like SHACL, XML Schema or JSON Schema should be considered
      • Would also demonstrate interoperability
    • Section 4.2: Crawler is mentioned to gather information from other sources into CaosDB. To retain semantic information, this requires a mapping from Linked Data semantics. A bijective mapping layer like R2RML or RML would need to be introduced to foster interoperability when a data set is registered or published.
    • Section 4.3: Again a list of features without relation to the previous lists of features and requirements
  • Section 5
    • Missing a critical evaluation, see comments in Overall Evaluation.
  • Section 7: Appendix Query Language Comparison:
    • Omits the additional expressivity of the SPARQL query, e.g.
      • Multi-language labels and queries
      • wdt:P27 "citizenship" vs. wdt:P19 "place of birth" with clearly defined semantics in WikiData vs. "country" in CaosDB
    • It is arguable if the added expressivity is always needed, however this is more likely if research data is shared across disciplines.

Minor Improvements:

  • Line 47: likely missing oxford-comma
  • Line 64: likely missing oxford-comma
  • Line 124: missing comma after "e.g."
  • Line 130: missing comma after "e.g."
  • Line 151: better "cloud storage systems" to avoid non-existing plural form of uncountable noun and confusion with "stores" ("shops")
  • Line 155, Caption of Figure 2: "meta data" should be "metadata"
  • Line 200, line 201: "templated queries" better be "query templates"
  • Line 245, line 248: Indention wrong
  • Line 296: Author name spelling/encoding error
  • Line 301ff: DOI based references could omit the visited date
  • Line 379: Missing a visited date

Downloads

Download Preprint

Metadata
  • Published: 2023-03-28
  • Last Updated: 2024-02-05
  • License: Creative Commons Attribution 4.0
  • Subjects: Data Management Software
  • Keywords: Data Management, Research Data Management, Agile Data Management, Software Tools, FAIR Data, Good Scientific Practice
Versions
All Preprints