Agile Research Data Management with Open Source: CaosDB

This is a Preprint and has not been peer reviewed. This is version 3 of this Preprint.

Downloads

Download Preprint

Supplementary Files
Authors

Daniel Hornung , Florian Spreckelsen, Thomas Weiß

Abstract

Research data management (RDM) in academic scientific environments increasingly enters the focus
as an important part of good scientific practice and as a topic with big potentials for saving
time and money. Nevertheless, there is a shortage of appropriate tools, which fulfill the
specific requirements in scientific research. We identified where the requirements in science
deviate from other fields and proposed a list of features which RDM software should fulfill to
become a viable option.

Finally we analyzed the open-source RDMS CaosDB for compatibility with the proposed features and
found that it fulfills the requirements.

Subjects

Data Management Software

Keywords

Data Management, Research Data Management, Agile Data Management, Software Tools, FAIR Data, Good Scientific Practice

Dates

Published: 2023-03-28 13:12

Older Versions
License

Creative Commons Attribution 4.0

Add a Comment

You must log in to post a comment.


Comments

Invited Review Comment #24 Torsten Bronger @ 2023-04-28 16:28

I fully agree with the premiss of the article that data management tools must have a flexible data model and practical use.  Moreover, I understand that CaosDB provides that data model and good functionality to search in it, which is an important use case. I am convinced that CoasDB fulfils its requirements and is good software per se.

Still, I don’t really understand which niche CaosDB is aiming at.  I see a huge functional overlap with ELNs (electronic lab notebook).  In fact, I consider CaosDB an ELN (without judging its quality as such).  However, the authors clearly distinguish between the terms ELN and RDMS.  This may make sense, but I don’t see an explanation that sets RDMS’es apart from ELNs.

Be that as it may, these days, an institute typically deploys an instance of an ELN and manages its data with that ELN.  ELNs can provide custom fields per record, can automatically import exterimental data via crawlers, and can search in the data (without SQL).  This all works decently for most users, in my experience.

Again, CaosDB may well be a fine tool for RDM, but as a reader I have difficulties in understanding its specific “short-term advantages” (quote from the article).  If it really is another beast than an ELN, and sort sort of umbrella for other ELNs and data sources – what is the benefit for the researcher?


Minor remarks

Line 189: Please add a note that this ELN integration is in a very early phase of development.

Figure 4: I think some readers would want to know why MariaDB was chosen for the backend rather than MongoDB. The latter is much closer to CaosDB’s memory model after all – just some restrictions would have to be imposed on the application level.

Lines 230–239 look odd to me.  They introduce a new issue, whereas the conclusions should wrap up the article.  I assume it is meant as an outlook, but then, it should be introduced as such.

Invited Review Comment #21 Marius Politze @ 2023-04-20 14:40

Content:

The article "Agile Research Data Management with Open Source: CaosDB" presents a requirements analysis process for "CaosDB", a research data management database system to link research data and store research metadata using mostly unstructured records with flexible properties. After a short introduction into the topic of RDM, the authors propose a schematic and exemplary research data life cycle and deduce their main requirements for their implemented software in section 2. To cover the requirements, the authors derive a set of features that their software should cover in section 3. Section 4 discusses architectural and structural decisions for CaosDB and very briefly points out some prominent features. The article closes with a brief conclusion and appendixes referencing the source code and giving a comparison of SPARQL and a CaosDB query.

Overall Evaluation:

The article is well written and understandable. However, it leaves me an impression to be more like a tool presentation than a scientific article. While this is not uncommon in the field of RDM, I am still missing at least a critical evaluation respecting the following:

  • Requirements are derived from a scientific workflow, features are proposed, and some are implemented by CaosDB. It does not get clear how CaosDB actually fulfills the requirements?
  • Being Open Source is prominently managed in the title, however it is not put into relation to the analyzed requirements: How does CaosDB being Open Source help its users? Especially
    • How are contributions managed?
    • How big is the community?
    • Is there a core maintenance or an overarching roadmap / vision? How can a user influence that?
  • Where and how does it compete against relational databases, or document based databases like mongoDB or OpenSearch when used in practice, e.g., with billions of datasets. Where does it excel, where does it fall short?
  • FAIR principles are mentioned in the keywords and the paper on FAIR principles is referenced in the Introduction. The relationship of CaosDB's features to the FAIR principles remains largely unclear. Especially (but not limited to):
    • F1. (Meta)data are assigned a globally unique and persistent identifier
    • I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.
    • I2. (Meta)data use vocabularies that follow FAIR principles
  • How does CaosDB ensure long term interoperability, e.g., with respect to a research data commons infrastructure? How does it relate to existing standards, and how does it help to build interoperable data collections rather than the next data silo?

Relevance:

Simplification of queries and a certain agility in the data models certainly are in my experience a non-neglectable issue for uses' acceptance of RDM metadata storage systems. CaosDB is not the first and certainly not the last system trying to cover this area. The article presents CaosDB as an example of research software engineering, where a product developed by a group of researchers eventually emerges as a tool for a bigger and cross-discipline community. The article also shows what I see as some of the major challenges: the adoption of standards and research data common infrastructures for interoperability and mid to long term availability as the tools grow. Technology wise other approaches like document oriented databases (elasticsearch / opensearch), data migraions for RDMS or schema or profile based (linked data) metadata stores are widely available.

Presentation quality:

  • Overall presentation quality is decent. Figures are legible. The formatting is according to the template.
  • Personally I do not like the mix of URLs in footnotes, in the text and in the references. It should be checked with the editors if there is a policy.

Major Improvements:

  • Misleading Title: While Open Source tool chains are generally desirable for RDM, throughout the text it does not clear where a user of CaosDB profits from it being Open Source. There are other mentioned features of CaosDB that seem to be much more prominently discussed in the article.
  • Section 2
    • The presented data worklfow likely comes from personal experience, a reference to some empirical backing that other labs have comparable processes would be desirable
  • Section 3
    • The section lists several proposed features, however it remains unclear how they relate to the previously derived requirements.
  • Section 4
    • Generally missing orientation towards standards e.g., DCAT, Prov
    • Section 4.2 / 4.3: Missing discussion about evaluation and quality assurance of data, based on the demo available I assume this is done by RecordTypes
      • Again standards like SHACL, XML Schema or JSON Schema should be considered
      • Would also demonstrate interoperability
    • Section 4.2: Crawler is mentioned to gather information from other sources into CaosDB. To retain semantic information, this requires a mapping from Linked Data semantics. A bijective mapping layer like R2RML or RML would need to be introduced to foster interoperability when a data set is registered or published.
    • Section 4.3: Again a list of features without relation to the previous lists of features and requirements
  • Section 5
    • Missing a critical evaluation, see comments in Overall Evaluation.
  • Section 7: Appendix Query Language Comparison:
    • Omits the additional expressivity of the SPARQL query, e.g.
      • Multi-language labels and queries
      • wdt:P27 "citizenship" vs. wdt:P19 "place of birth" with clearly defined semantics in WikiData vs. "country" in CaosDB
    • It is arguable if the added expressivity is always needed, however this is more likely if research data is shared across disciplines.

Minor Improvements:

  • Line 47: likely missing oxford-comma
  • Line 64: likely missing oxford-comma
  • Line 124: missing comma after "e.g."
  • Line 130: missing comma after "e.g."
  • Line 151: better "cloud storage systems" to avoid non-existing plural form of uncountable noun and confusion with "stores" ("shops")
  • Line 155, Caption of Figure 2: "meta data" should be "metadata"
  • Line 200, line 201: "templated queries" better be "query templates"
  • Line 245, line 248: Indention wrong
  • Line 296: Author name spelling/encoding error
  • Line 301ff: DOI based references could omit the visited date
  • Line 379: Missing a visited date