PIA - A Concept for a Personal Information Assistant for Data Analysis and Machine Learning of Time-Continuous Data in Industrial Applications

This is a Preprint and has not been peer reviewed. A published version of this Preprint is available on ing.grid . This is version 5 of this Preprint.

Authors

Christopher Schnur, Tanja Dorst, Kapil Sajjan Deshmukh, Sarah Zimmer, Philipp Litzenburger, Tizian Schneider, Lennard Margies, Rainer Müller, Andreas Schütze

Abstract

A database with high-quality data must be given to fully use the potential of Artificial Intelligence (AI). Especially in small and medium-sized companies with little experience with AI, the underlying database quality is often insufficient. This results in an increased manual effort to process the data before using AI. In this contribution, the authors developed a concept to enable inexperienced users to perform a first data analysis project with machine learning and record data with high quality. The concept comprises three modules: accessibility of (meta)data and knowledge, measurement and data planning, and data analysis. Furthermore, the concept was implemented as a front-end demonstrator on the example of an assembly station and published on the GitHub platform for potential users to test and review the concept.

Comments

Comment #55 Robert H. Schmitt @ 2023-09-18 09:20

As the responsible topical editor, I would like to thank the authors for answering the reviewers and revising their submission. My thanks to the reviewers as well.

Hereby, The revised paper is accepted.

I would like to ask the authors to upload their final version (including any final language improvements) and their compressed LaTeX project folder in order to be processed for publication.

Invited Review Comment #54 Anas Abdelrazeq @ 2023-09-18 03:27

I would like to thank the authors for revising the manuscript. It seems like they took my comments into consideration and responded to them.

One more recommendation would be a final language check before the publication.

  • In the title for example: time-continuous could be capitalized. Also, "Application" can be "Applications"

Invited Review Comment #48 Marco Kemmerling @ 2023-09-04 06:34

The authors have addressed all major and minor comments to my satisfaction.

I am hereby recommending the submission for publication.

Comment #46 Christopher Schnur @ 2023-08-25 06:21

Reviewer Reply:

Dear Reviewers, thank you for your valuable feedback. The authors are convinced that your feedback significantly improved the quality of the publication. In the following, the major and minor points of the reviewers had been broken down and edited. Each point contains a reply.

___________________________
Reply to Reviewer 1
(Invited Review Comment #38 Marco Kemmerling @ 2023-07-27 10:33):

Major improvements:
1. First, the state of the art should be clearly described in the introduction or a separate section. Have similar tools been developed? If yes, what is the novelty of the presented approach and what gap does it fill?
-Reply: To the knowledge of the authors, there is no comparable existing tool per se as data analysis, respectively the assistance in analyzing data, is often sold as a service. However, multiple comparable tools exist regarding the individual modules M1-M3. Therefore, the authors added related work and the state of the art for each of the three modules individually. Furthermore, the authors extended the introduction to highlight the novelties of the presented approach and show what gap the framework PIA fills.

2. The further structure of the article should be improved so that contents appear in suitable sections. For instance, Subsection 2.2 "Use Case: Assembly line" starts very abruptly and does not fit into the parent section called "Theoretical background and Methodology". The transition to Subsection 2.3 is also very sudden. The majority of the contents in the section called "Results" are not actually results, but details about the practical implementation. The title of the section should reflect its contents. Perhaps the following overall structure would be more suitable: Introduction, state of the art, theoretical background and methodology, implementation, exemplary usage on an assembly line use case, conclusion. Subsection 2.2 “Use Case: Assembly line” had been moved to 4.1 in the chapter Implementation and Results.
-Reply: The authors changed the overall structure and orientated on the reviewer’s suggestion. The new sections are: Introduction, Theoretical Background and Methodology, Implementation and Results, Conclusion and Outlook.

3. The scope of the presented application is unclear to me. Are the algorithms in Figure 4 are just an example / a subset of the capabilities of the toolbox or the whole available functionality? The figure suggests the toolbox is only suitable for classification, but later on, Figure 9 seems to show forecasting results.
-Reply: The authors extended the description of the toolbox to clearly highlight why it is presented in the paper and what its focus is. The standard algorithms of the toolbox are the five feature extraction and three feature selection methods followed by a classification method. The 15 combinations can be automatically tested by only using five lines of code. However, the toolbox contains more algorithms and is also capable of regression tasks, when used in manual mode. Theoretically, users can cherry-pick methods out of all available methods in the toolbox and adapt a new standard which suits their demands better. Figure 9 shows the prediction of a classification task. Maybe the way MATLAB plots datapoints (by interpolating lines) lead to the confusion.

4. The scope and limits of the presented approach should be clearly stated. Is it really an assistant for data analysis and machine learning in general? Or is it only suitable for classification on time series data?
-Reply: The concept itself is an assistant for data analysis and machine learning in general. The authors extended their description of module 3, to point out that the users are free in the choice of their algorithms, but the authors focus within their contribution on time-continuous data.

5. I would further expect the title of the submission to reflect the capabilities of the proposed approach, as having a more specific title will help the target audience find the submission.
-Reply: The authors changed the title of their contribution to “PIA - A Concept for a Personal Information Assistant for Data Analysis and Machine Learning of time-continuous Data in Industrial Application” to make it more specific and highlight the aspect of the primary usage of algorithms for time-continuous data (in addition to the reviewer’s remark in 4.)

6. Some assertations could benefit from more details and some choices could be better motivated. For instance, in Subsection 2.5, it is said that the ML toolbox of Dorst et al. and Schneider et al. was used, but this choice is not motivated in any way. Some reasoning would help the reader understand why this choice was made.
-Reply: The authors extended the description of the toolbox and motivated their choice of using the toolbox. Furthermore, they added further literature on the performance of the toolbox on different datasets.

7. In the conclusion, it is claimed that adding a back-end can unfold the full potential of the concept. Why is this? It would be better to spell it out, so the reader does not have to guess.
-Reply: The sentence has been reformulated and an explanation on why adding a back-end would be helpful has been added: “By adding an additional back end, the progress of a project, attached files, comments etc., can be saved and loaded properly. Furthermore, a direct connection to a data base is conceivable.”

8. Similarly for the phrase "the authors will further develop their concept". It would be interesting to know how the authors will further develop the concept.
-Reply: The authors added the following sentence to display the further development of their concept: Current development focuses on the integration of ontology's, respectively machine readable metadata in module 1, the generalization of the checklist in module 2 and improvements regarding the usability for inexperienced users in SMEs and, in module 3, the integration of a data pipeline to evaluate the data quality as well as the usage of algorithms that consider measurement uncertainty.

9. It should be clarified whether the submission is intended to be a research article or a software descriptor. The contents point to the latter, but it says research article in the submitted file. If the submission is intended as a software descriptor, instructions on how to contribute to the project should be included in the repository. If it is intended as a research article, I would expect a thorough evaluation of results and much fewer implementation details.
-Reply: The contribution is a research article with a focus on applied research. It is describing a concept that can be used, modified, or further developed by users. The authors agree that the evaluation is a very important part in research and are therefore working on a second use case. However, the exchange in scientific communities is a very important part of research as well. For this contribution the authors decided to use opportunities that the journal ing.grid offers, especially the interaction through comments, to get feedback at an early stage. Furthermore, the authors believe that the shown implementation details match the journals goals, regarding transparency and re-usability.

Minor improvements:
1. The abstract should end with a short outlook or implication of the presented results.
-Reply: The authors changed the last sentence to “Furthermore, the concept was implemented as a front-end demonstrator on the example of an assembly station and published on the platform GitHub for potential users to test and review the concept.”

2. The term high-quality database seems unnecessarily ambiguous. Does it describe the database itself (schema design etc.) or its contents?
-Reply: The sentence had been reformulated into “A database with high-quality data must be given to fully use the potential of Artificial Intelligence (AI)” to remove the ambiguity.

3. The capitalization of (sub)section headings as well as the tile of the submission is not consistent.
-Reply: Changed into consistent capitalization of headings.

4. Line 5: There is a word missing. Maybe: "[...] committees exclusively concern themselves with [...]"
-Reply: Changed into “…entire research fields and committees exclusively focus on improving data and their quality…”

5. Line 32: An English version.
-Reply: Changed.

6. Line 45: The meaning "high quality / excellent" of the word "solid" is typically only intended in colloquial contexts
-Reply: Changed into “…AI in the industrial context is a database with high-quality data”.

7. Line 124: It is claimed that a 10-fold cross-validation automatically determines the best combination. This should be explained in more detail. I assume that the following is happening: A model is built for each of the resulting 15 combinations and the performance of each model on unseen data is evaluated by 10-fold cross-validation. The best performing model is then automatically selected. This should be written out, so that the reader does not have to assume what is happening.
-Reply: The reviewer is totally right, the authors changed the description into “A 10-fold cross-validation automatically determines the best of the resulting 15 combinations, by ranking the combinations according to their resulting cross-validation error on the test-data”

8. Line 221: A sentence ends with "Figure 9", but this does not fit into the structure of the sentence. It is also generally unclear to me why the word figure is sometimes written out and sometimes abbreviated.
-Reply: Changed to “… (Figure 10).” (Note: The number of the figure increased due to another reviewer comment). Within this work, the “cleverref”-package is used for referring to figures, tables etc. The default settings of the package differ between \cref (lowercase and abbreviated) and \Cref (capitalized and written out). To be consistent, the authors decided to use always \Cref.

9. Caption of Figure 9: "connect" should be "connected"
-Reply: Changed
___________________________
Reply to Reviewer 2
(Invited Review Comment #39 Anas Abdelrazeq @ 2023-07-27 12:09)

Major Points
1. The story line of the paper is hard to follow.
-Reply: The general structure of the paper had been updated and multiple sentences had been added that guide the reader through the paper and give it orientation.

2. The authors start with the pain point of inexperienced engineers about data analytics proposing a solution concept where it is not clear how these pain points are met or eased.
-Reply: The authors extended the conclusion and highlight the benefit of the concept to the reader. Most of the mentioned pain points, like e.g., lack of synchronization, appear in an advanced stage of the data analysis but are fairly easy to tackle or avoid in an early stage. Since the underlying idea behind the concept is the prevention of these pain points, a direct comparison between using the concept and not using the concept is not possible.

3. Several sections in the paper popped up with no clear introduction and conclusion within them.
-Reply: The authors added introductions (especially in chapter 2) and conclusions (especially in chapter 3) to increase the readability and understandability of the presented content.

4. Related work and current state of the art in the domain of assisting data analysts with “personal” tool are not tackled. How does the proposed tool stand between existing ones. How does it help in comparison to existing solutions?
-Reply: To the knowledge of the authors, there is no comparable existing tool per se as data analysis, respectively the assistance in analyzing data, as it is often sold as a service. However, multiple comparable tools exist regarding the individual modules M1-M3. Therefore, the authors added related work and the state of the art for each of the three modules individually.

5. The main contribution is not clearly stated. Is it the fact that several assisting modules are connected/combined in a special/new way? or is it bringing these modules into an interface that assists with the use of them.
-Reply: The authors added a paragraph to clearly highlight the main contribution. However, the reviewer is right with his assumption. PIA combines assisting modules in complementary manner and brings them into an interface that assists users performing a data analysis.

6. The concept of the framework is based on 3 existing solutions or concepts M1-M3. Here it is not clear how the three were chosen and how combining them together brings a new value. Are the three modules integrated in a sequential form to each other? Or do they serve as separate assisting points? The connection between them should be made clear.
-Reply: The authors added, in course of the literature review and state-of-the-art, how each module was chosen. The resulting added value for the users was added in chapter 3 after each module. Regarding the connection of the modules, in Figure 1 on the right-hand side a schematic clarifies which module is involved in which step of a machine learning project.

7. In case the core contribution lies with the interface, then: How are designing elements make it easier for the analyst to carry out their tasks and how are they evaluated.
-Reply: The interface is not the core contribution and therefore not the focus of this paper. For demonstration purposes, it is shown what an interface could look like and therefore the element design etc. is out of the scope of this paper. However, the design and user-friendliness of the presented use-case can be improved in the end use of the application.

8. In the introduction, M1 module aim was claimed to link data and knowledge for further insights. How is that achieved? Later, it is described that M1 aims eventually to extract lessons learned and then have them saved in a registry. The goal of this step is ambiguous from my point of view.
-Reply: The authors edited the description of M1 in the introduction to remove the ambiguousness.

9. Also, is M1 connected with M2 in any of the 6 data analysis steps? Do the lessons learned feed into the check list in any form?
-Reply: The authors extended the description of the modules and highlighted the connection between M1 and M2 and how the lessons learned feed into the checklist. The lessons learned feed in the checklist as the tip and hint boxes.

10. The assembly use case was presented before the concept design. Does it serve as a requirement base for the design? Or for validation purposes? Authors should elaborate on that.
-Reply: The authors moved the assembly use case into chapter 3 (Implementation and Results) and highlighted that the use-case is for validating the concept.

11. How were the requirements of the design extracted from the use case? To end up with the 3 modules and their interface?
-Reply: The authors extended their description of how the modules where chosen and which needs of the user they tackle. Furthermore, they moved the use case to chapter 3 to clarify that the concept was created before the use case and that the framework is applied to the use case to validate the framework.

12. In case the assembly line was brough as a validation use case. How this use case specifically can play a role in the generalization of the approach for different and further use cases.
-Reply: The combination of two different stations with different processes and different degrees of automation as well as the opportunity to produce a second variant of the device holder make this assembly line a good use case for demonstrating the flexibility of PIA while keeping the complexity low (compared to more extensive assembly lines).

13. FAIR data principles were mentioned as a base of the framework. How were these principles practically assured via the concept design?
-Reply: The GUI and structure of PIA makes data, metadata and domain specific knowledge findable and accessible. M2, the checklist, recommends multiple aspects of the FAIR principles that ensure that new data can be recorded in a FAIR manner. As the checklist has a dedicated publication, the authors refer to [3] for more details.
[3] C. Schnur, S. Klein, A. Blum, A. Schütze, and T. Schneider, “Steigerung der Datenqualität in der Montage,” wt Werkstattstechnik online, vol. 112, pp. 783–787, Dec. 2022. DOI: 10.37544/1436-4980-2022-11-12-57.

Other Improvements
1. Consistency in the paper title letters case – suggested correction “… Industrial Application(s)”
-Reply: Changed into capital letters

2. Line 3: Letter case “Artificial Intelligence”. If letters were capitalized to introduce an abbreviation, it should be consistent all over the paper.
-Reply: Consistently capitalized all over the paper

3. Line 7-8: Reformulate the sentence.
-Reply: Reformulated into “In case of typical condition monitoring tasks, for example, an early detection of damages or wear down of machine parts can avoid unplanned machine downtime costs.”

4. Line 32: “An English”
-Reply: Changed.

5. Introducing the structure of the paper at the end of the introduction can be added.
-Reply: The structure of the paper has been added at the end of the introduction.

6. Consistency with referencing. Sometimes the Authors refer to the refence with the number and sometimes with names.
-Reply: To increase the consistency, the authors updated several citations and referred to the reference with a number as well when they used the authors name e.g., Wilhelm et al. [9].

7. What is the reason for choosing virtual machines over newer approaches such as Docker stations?
-Reply: Here, the Authors decided to use VM due to existing boundary conditions and resources. However, Docker stations is a possible solution as well. The focus of this contribution is the concept.

8. Line 159: Figure 10 was mentioned, it can be inserted around there as well. No need to bring it up in appendix. Other figures were shown as part of the main text as well.
-Reply: Figure 10 was removed from the appendix and inserted into the main text.

Comment #43 Robert H. Schmitt @ 2023-08-07 10:31

As the responsible topical editor, I would like to thank the reviewers for their constructive feedback.
After consideration of the comments, I advise the authors to revise the publication considering the reviews and their suggestions. After the rework, a new version of the publication including the revisions may be submitted by the authors for further consideration.

Invited Review Comment #39 Anas Abdelrazeq @ 2023-07-27 05:09

The paper “PIA - A Concept for a Personal Information Assistant for Data Analysis and Machine Learning in industrial application” tackles the pain of fresh and inexperienced engineers with data analytics steps motivated by the fact data and AI-related analysis is crucial for nowadays small and medium sized companies.

The general outcome of the paper is the concept of the assisting framework, PIA. That is demonstrated via a user interface UI (as a mobile and desktop application and based on the FAIR data principles) that assists inexperienced users to perform their “first” data analytics activities.

Major Points

The story line of the paper is hard to follow. The authors start with the pain point of inexperienced engineers about data analytics proposing a solution concept where it is not clear how these pain points are met or eased. Several sections in the paper popped up with no clear introduction and conclusion within them.

Related work and current state of the art in the domain of assisting data analysts with “personal” tool are not tackled. How does the proposed tool stand between existing ones. How does it help in comparison to existing solutions?

The main contribution is not clearly stated. Is it the fact that several assisting modules are connected/combined in a special/new way? or is it bringing these modules into an interface that assists with the use of them.

  • -          The concept of the framework is based on 3 existing solutions or concepts M1-M3. Here it is not clear how the three were chosen and how combining them together brings a new value. Are the three modules integrated in a sequential form to each other? Or do they serve as separate assisting points? The connection between them should be made clear.
  • -          In case the core contribution lies with the interface, then: How are designing elements make it easier for the analyst to carry out their tasks and how are they evaluated.

In the introduction, M1 module aim was claimed to link data and knowledge for further insights. How is that achieved? Later, it is described that M1 aims eventually to extract lessons learned and then have them saved in a registry. The goal of this step is ambiguous from my point of view. Also, is M1 connected with M2 in any of the 6 data analysis steps? Do the lessons learned feed into the check list in any form?

The assembly use case was presented before the concept design. Does it serve as a requirement base for the design? Or for validation purposes? Authors should elaborate on that.

-          How were the requirements of the design extracted from the use case? To end up with the 3 modules and their interface?

-          In case the assembly line was brough as a validation use case. How this use case specifically can play a role in the generalization of the approach for different and further use cases.

FAIR data principles were mentioned as a base of the framework. How were these principles practically assured via the concept design?

Other Improvements

  • -          Consistency in the paper title letters case – suggested correction “… Industrial Application(s)”
  • -          Line 3: Letter case “Artificial Intelligence”. If letters were capitalized to introduce an abbreviation, it should be consistent all over the paper.
  • -          Line 7-8: Reformulate the sentence.
  • -          Line 32: “An English”
  • -          Introducing the structure of the paper at the end of the introduction can be added.
  • -          Consistency with referencing. Sometimes the Authors refer to the refence with the number and sometimes with names.
  • -          What is the reason for choosing virtual machines over newer approaches such as Docker stations?
  • -          Line 159: Figure 10 was mentioned, it can be inserted around there as well. No need to bring it up in appendix. Other figures were shown as part of the main text as well.

Invited Review Comment #38 Marco Kemmerling @ 2023-07-27 03:33

The submission introduces a concept and software package to guide inexperienced users through data collection and data anlysis processes. To achieve this, three modules pertaining to knowledge management, data acquisition planning, and data anlysis are combined within one tool. While such a tool can be of great use for the target audience, I suggest making some modifications to improve the quality of the submission before publication.

Major improvements:

First, the state of the art should be clearly described in the introduction or a separate section. Have similar tools been developed? If yes, what is the novelty of the presented approach and what gap does it fill?

The further structure of the article should be improved so that contents appear in suitable sections. For instance, Subsection 2.2 "Use Case: Assembly line" starts very abruptly and does not fit into the parent section called "Theoretical background and Methodology". The transition to Subsection 2.3 is also very sudden. The majority of the contents in the section called "Results" are not actually results, but details about the practical implementation. The title of the section should reflect its contents. Perhaps the following overall structure would be more suitable: Introduction, state of the art, theoretical background and methodology, implementation, exemplary usage on an assembly line use case, conclusion.

The scope of the presented application is unclear to me. Are the algorithms in Figure 4 are just an example / a subset of the capabilities of the toolbox or the whole available functionality? The figure suggests the toolbox is only suitable for classification, but later on, Figure 9 seems to show forecasting results. The scope and limits of the presented approach should be clearly stated. Is it really an assistant for data analysis and machine learning in general? Or is it only suitable for classification on time series data? I would further expect the title of the submission to reflect the capabilities of the proposed approach, as having a more specific title will help the target audience find the submission.

Some assertations could benefit from more details and some choices could be better motivated. For instance, in Subsection 2.5, it is said that the ML toolbox of Dorst et al. and Schneider et al. was used, but this choice is not motivated in any way. Some reasoning would help the reader understand why this choice was made. In the conclusion, it is claimed that adding a back-end can unfold the full potential of the concept. Why is this? It would be better to spell it out, so the reader does not have to guess. Similarly for the phrase "the authors will further develop their concept". It would be interesting to know how the authors will further develop the concept.

It should be clarified whether the submission is intendend to be a research article or a software descriptor. The contents point to the latter, but it says research article in the submitted file. If the submission is intended as a software descriptor, instructions on how to contribute to the project should be included in the repository. If it is intended as a research article, I would expect a thorough evaluation of results and much fewer implementation details.

Minor improvements:

  • The abstract should end with a short outlook or implication of the presented results.
  • The term high-quality database seems unnecessarily ambiguous. Does it describe the database itself (schema design etc.) or its contents?
  • The capitalization of (sub)section headings as well as the tile of the submission is not consistent.
  • Line 5: There is a word missing. Maybe: "[...] committees exclusively concern themselves with [...]"

  • Line 32: An English version
  • Line 45: The meaning "high quality / excellent" of the word "solid" is typically only intendend in colloquial contexts
  • Line 124: It is claimed that a 10-fold cross-validation automatically determines the best combination. This should be explained in more detail. I assume that the following is happening: A model is built for each of the resulting 15 combinations and the performance of each model on unseen data is evaluated by 10-fold cross-validation. The best performing model is then automatically selected. This should be written out, so that the reader does not have to assume what is happening.
  • Line 221: A sentence ends with "Figure 9", but this does not fit into the structure of the sentence. It is also generally unclear to me why the word figure is sometimes written out and sometimes abbreviated.
  • Caption of Figure 9: "connect" should be "connected"

Downloads

Download Preprint

Metadata
  • Published: 2023-05-05
  • Last Updated: 2023-09-26
  • License: Creative Commons Attribution 4.0
  • Subjects: Data Governance, Data Literacy
  • Keywords: Machine Learning, Data Analysis, Measurement and data planning
Versions
All Preprints