Abstract:
The development of reliable machine learning (ML) models fundamentally depends on the availability of large, diverse, and balanced datasets. However, in practice, limited and imbalanced data pose common challenges in both scientific research and business applications. The aim of this paper is to analyse the fundamental financial, regulatory, ethical, and technical constraints that contribute to dataset limitations, with a particular focus on their impact on model robustness and generalizability. High acquisition costs, intellectual property restrictions, and inadequate labelling practices limit data availability, while regulatory frameworks impose strict constraints on data usage and cross-border transfer. Technical challenges, including insufficient computational resources, label noise, and integration difficulties, further expand the problem of small datasets. For business applications in fields such as finance, healthcare, and manufacturing, these constraints not only hinder predictive accuracy but also impact decision-making efficiency. Understanding these factors is of key importance for developing strategies that reduce dataset limitations, ensure the preparation of correct and sufficient datasets, and support efficient ML.
