Comments
Comment #183 Jonas Maximilian Werheid @ 2024-10-30 21:34
Dear Reviewer #132,
thank you very much for your feedback.
The following answers are ordered corresponding to your feedback:
• In the revised manuscript, we have standardized the terminology to: "small and medium-sized enterprises".
• The subject under investigation is now specified in the revised manuscript to: „However, the literature does not identify object detection datasets as best practices in the context of exemplary manufacturing applications and adherence to FAIR principles.“.
• It was meant to be normalized image pixels. Therefore, the text is changed in the revised manuscript to: "Width and height represent the dimensions of the bounding box in normalized pixels, where pixel values are scaled between 0 and 1, relative to the image dimension.".
• We mean the center of the bounding box. Therefore, the revised manuscript specifies this: "[…]height and width of each bounding box center[…]".
• Exactly, the shapes of all objects are constant, as well as the camera distance from top to ground. However, the distance from the camera to the object is then specified by the angle of the object to the camera. Therefore, in the revised manuscript it is corrected to: „Moreover, the setup includes fluorescent lighting directed toward the objects under inspection, with the camera positioned on a tripod to maintain a fixed distance from the ground, where all the objects are placed. The distance between the camera and the objects is determined by the angle each object has relative to the camera.“.
• The description of P and R is modified in the revised manuscript to: „To clarify this, \textit{P} is introduced to only indicate relevant ones. It measures the proportion of correctly recognized objects out of all detected objects. \textit{R}, on the other hand, measures the proportion of relevant objects that were correctly recognized by the model out of all relevant objects.“, as well as AP𝑖 explicitly introduced: „The area under this curve provides the average precision for each class (AP𝑖) for the trained model.“.
• When using YOLOv5, it is always necessary to allocate images to the validation set, as the training process will not begin unless the model detects the path to the validation set. This validation set acts as a regularization measure to help prevent overfitting in our use-case as in general.
• We agree that it would be interesting to include the initial training accuracies. While the structure of the validation process was adjusted in response to the other reviewer’s feedback, we have also added information on the training accuracy of each model in the Appendix. However, we emphasize that the validation and test accuracies are more important for our approach, as they reflect the model's performance on unseen data during and after training. This is why these metrics are described in more detail.
• The links in Section 5 are updated to work properly and match the visible text.
Comment #182 Jonas Maximilian Werheid @ 2024-10-30 21:33
Dear Reviewer #134,
thank you very much for your feedback.
The following answers are ordered corresponding to your feedback:
• The defects are manually produced in a controlled environment, but each defect varies in the way it damages the surface. The intention is not to represent all types of surface damage occurring in manufacturing, but rather to showcase surface anomalies as typical examples of damage in manufacturing. We agree that this can somehow be seen as a limitation and updated the conclusion: „However, it’s important to note that the limitation lies in the inability to directly apply such models or data to unrelated tasks. Additionally, it is important to recognize that industrial damages can significantly differ in the complexity of their defects.“
• There are two key points to consider for the publication overall. First, the dataset is intended to serve as an exemplary educational resource, particularly designed for low-barrier initiatives, such as those in SMEs. It enables the development of computer vision algorithms, like YOLO Object Detection, with minimal computational resources due to the limited number of image instances and their low resolution, resulting in shorter training times. It is not intended for direct application to specific use cases, which is noted as a limitation in the conclusion. Second, the dataset descriptor is also designed as a showcase of a FAIR data publication for engineering data, with quantitative evaluation of its FAIRness using the FUJI tool, aligning with the journal’s aim, which focuses on „FAIR data management in engineering sciences“(https://inggrid.org). We included this aspect in the revised manuscript also in our abstract: „Furthermore, real-world datasets often lack adherence to FAIR principles, which limits their accessibility and interoperability[…]“; „While our focus is on demonstrating object detection with low-resolution images and limited data availability, the generated data and trained model also adhere to FAIR principles. Therefore, these resources are made available with proper metadata to support their reuse and further investigation[…]“.
Lastly, we would like to offer some resources that address data availability challenges in SMEs [https://doi.org/10.3390/app11146378, https://doi.org/10.1016/j.eswa.2023.119623], as well as references on handling low resolution in computer vision [https://doi.org/10.1016/j.engappai.2023.107206], which can serve as examples for these discussions. The original camera resolution of 1148x862 was cropped to 640x640 to align with the recommendation for the YOLOv5 algorithm (https://docs.ultralytics.com/yolov5/tutorials/train_custom_data/#13-prepare-dataset-for-yolov5). In the revised manuscript, we included the exact camera resolution prior to processing with Roboflow: "Roboflow is used for auto-orienting to eliminate common rotations based on metadata, standardizing pixel ordering, and resizing the images to a frame of 640x640 pixels from the original resolution of 1640x1232." In this case, a higher resolution was not necessary and could negatively impact the inference time of the application.
• We agree that considering imbalance as a typical challenge in manufacturing datasets is an interesting point for analysis. First, the published dataset can be utilized to create such an imbalance, as the metadata indicates the number of defective and non-defective instances in each image. Additionally, in the revised manuscript, we trained imbalanced models for the 1st and 2nd datasets, describing the class imbalance distribution and presenting their results. All sections have been updated in the revised manuscript to explain the motivation for addressing class imbalance, and to specify how it was created and analyzed.
• There is a possibility that the model may have overfitted due to the repetition of the same defect instances across multiple images. To address this, we created new defects and generated up to 150 additional images, replacing the existing 150 in the testing dataset. These new images included only previously unseen surface defects that the models had not encountered during Training. This adjustment was incorporated into our article and led to lower accuracy rates across all metrics in the Testing set. The modified manuscript discusses the reduced performance resulting from the more challenging test data along with the imbalance.
• The newly created 150 instances increase the difficulty of training for this benchmark. Additionally, we adjusted the first two datasets, which initially used a relatively small amount of training data, by allocating a larger portion to validation during training. The new distribution of train/valid/test is selected to 0.64/0.16/0.2 (0.8*0.8/0.8*0.2/0.2). Although the complexity has increased with the new test dataset, which has also been uploaded in this revision on Zenodo, we would like to once again emphasize the broader scope of this research as a FAIR data publication. It serves as an example within the context of research data management, supporting low-barrier initiatives for education and computer vision in SMEs.
Comment #172 Thomas Bauernhansl @ 2024-10-02 11:43
As the responsible topical editor, I would like to thank the reviewers for their constructive feedback.
After consideration of the comments, I advise the authors to revise the publication considering the reviews and their suggestions. Please submit a revised version of the publication within 4 weeks for further consideration.
Invited Review Comment #134 Anonymous @ 2024-09-10 22:32
This paper, titled "Simplified Object Detection for Manufacturing: Introducing a Low-Resolution Dataset," presents a low-resolution dataset for object detection in manufacturing, specifically focused on identifying defects in plastic bricks.
The main contributions include the introduction of a simplified dataset, the evaluation of the dataset using the YOLOv5 model, and an analysis of the impact of varying data availability on detection performance.
The dataset and experimental approach presented in this paper have several notable limitations that diminish the practical relevance and value of the study.
1. The dataset does not accurately represent a realistic production environment, and it lacks diversity. It consists solely of plastic bricks, but the variability between these bricks and their defects is not clearly defined. The dataset only features similar, manually created defects under controlled conditions, failing to capture the complexity and variability typically seen in real-world manufacturing scenarios. The experimental results further suggest that the dataset is overly simplistic, raising questions about its practical utility and the potential insights it offers.
2. The authors state their motivation as filling a gap in the literature concerning low-resolution datasets with limited data availability for object detection in manufacturing. However, this raises concerns about the actual benefit for SMEs. Computer vision systems for quality inspection are usually tailored for specific products or production lines, often using high-resolution cameras, which are widely accessible and not a significant barrier. The choice of using plastic bricks in a low-resolution setting implies a problem that may not align with real manufacturing challenges. The authors should provide concrete evidence to support their claim that low-resolution and limited data are common issues in such contexts; otherwise, the dataset appears to be more of a theoretical exercise rather than a practical contribution.
3. In practical manufacturing scenarios, the primary challenge lies in collecting sufficient data on defective items, as the data are typically strongly biased towards non-defective samples. This imbalance is the real data availability problem. However, this is not adequately taken into account in the presented experiment.
4. The reported good results in the experiments might be due to overfitting, especially if the training and test images are too similar. The authors should provide more details on how the training and test datasets were selected, along with examples of the images used, to ensure that the model's performance is not artificially inflated by data redundancy or lack of variation.
5. The experimental results demonstrate that the task and dataset are too easy, as the model achieves high performance even with a very small training set of just 35 samples. This raises significant doubts about the value of the dataset—both as a benchmark and as a resource for transfer learning. The simplicity and narrow scope of the dataset make it unlikely that the trained model could serve as a robust base model for other, more complex tasks. The high accuracy could simply reflect the model's overfitting to this trivial problem rather than its potential for broader applicability.
Invited Review Comment #132 Anonymous @ 2024-09-09 18:38
The paper presents and analyzes a low resolution dataset. The authors use a state of the art real-time object detection algorithm (YOLOv5) to detect and classify objects within the dataset.
While the novelty of the paper seems limited, the open access dataset seems useful for developing and comparing object detection and classification algorithms.
Notes to the authors:
- Please use uniform terminology (e.g. small- and middle-sized enterprises vs. small- and medium-sized enterprises)
- In section 1 (“However, the literature found does not describe the specific subject area under investigation.”): Can you specify what the specific subject area under investigation is?
- In section 2.1 (“Width and height represent the dimensions of the bounding box in pixels.”): The values in Table 1 do not seem to be in pixels.
- In section 2.1 (“height and width of each center”): Do you mean “of each bounding box”?
- In section 2.2, in line 100 it is stated that the distance between the camera and object is constant, however in Figure 1b it is stated that the size indicates the dimension of an object and the distance to the camera. Shouldn’t the different sizes of the bounding box instead result from the dimension and the orientation of the object?
- In section 3.1, the textual explanations of P and R are not very precisely formulated, especially in the case of P: “divided by the total number of objects” -> should be something like “total number of detections”. Please revise this paragraph. Also regarding the formula for mean average precision, please mention what 𝐴𝑃𝑖 is.
- What is the reason for the separate validation set and testing set. Do you do hyperparameter optimization using the validation set? This is not mentioned in the paper. Or do you just want to test with different test set sizes?
- Minor note: Since the validation set and testing set are the same for 1st through 4th, why not just use the training set size in the tables and graphs instead of 1st through 4th?
- In section 5, the link does not work (and does not match the displayed text)