TY - JOUR
T1 - Reducing annotation effort in agricultural data
T2 - simple and fast unsupervised coreset selection with DINOv2 and K-means
AU - Gómez-Zamanillo, Laura
AU - Portilla, Nagore
AU - Picón, Artzai
AU - Egusquiza, Itziar
AU - Navarra-Mestre, Ramón
AU - Elola, Andoni
AU - Bereciartua-Perez, Arantza
N1 - Publisher Copyright:
Copyright © 2025 Gómez-Zamanillo, Portilla, Picón, Egusquiza, Navarra-Mestre, Elola and Bereciartua-Perez.
PY - 2025
Y1 - 2025
N2 - The need for large amounts of annotated data is a major obstacle to adopting deep learning in agricultural applications, where annotation is typically time-consuming and requires expert knowledge. To address this issue, methods have been developed to select data for manual annotation that represents the existing variability in the dataset, thereby avoiding redundant information. Coreset selection methods aim to choose a small subset of data samples that best represents the entire dataset. These methods can therefore be used to select a reduced set of samples for annotation, optimizing the training of a deep learning model for the best possible performance. In this work, we propose a simple yet effective coreset selection method that combines the recent foundation model DINOv2 as a powerful feature selector with the well-known K-Means clustering method. Samples are selected from each calculated cluster to form the final coreset. The proposed method is validated by comparing the performance metrics of a multiclass classification model trained on datasets reduced randomly and using the proposed method. This validation is conducted on two different datasets, and in both cases, the proposed method achieves better results, with improvements of up to 0.15 in the F1 score for significant reductions in the training datasets. Additionally, the importance of using DINOv2 as a feature extractor to achieve these good results is studied.
AB - The need for large amounts of annotated data is a major obstacle to adopting deep learning in agricultural applications, where annotation is typically time-consuming and requires expert knowledge. To address this issue, methods have been developed to select data for manual annotation that represents the existing variability in the dataset, thereby avoiding redundant information. Coreset selection methods aim to choose a small subset of data samples that best represents the entire dataset. These methods can therefore be used to select a reduced set of samples for annotation, optimizing the training of a deep learning model for the best possible performance. In this work, we propose a simple yet effective coreset selection method that combines the recent foundation model DINOv2 as a powerful feature selector with the well-known K-Means clustering method. Samples are selected from each calculated cluster to form the final coreset. The proposed method is validated by comparing the performance metrics of a multiclass classification model trained on datasets reduced randomly and using the proposed method. This validation is conducted on two different datasets, and in both cases, the proposed method achieves better results, with improvements of up to 0.15 in the F1 score for significant reductions in the training datasets. Additionally, the importance of using DINOv2 as a feature extractor to achieve these good results is studied.
KW - agriculture
KW - coreset selection
KW - foundation models
KW - label-efficient learning
KW - unsupervised clustering
UR - https://www.scopus.com/pages/publications/105006673510
U2 - 10.3389/fpls.2025.1546756
DO - 10.3389/fpls.2025.1546756
M3 - Article
AN - SCOPUS:105006673510
SN - 1664-462X
VL - 16
JO - Frontiers in Plant Science
JF - Frontiers in Plant Science
M1 - 1546756
ER -