2.8 Challenges in Supervised Learning
Key challenges to address in supervised machine learning include the following:
Insufficient training data: Complex models with many parameters require large, labeled datasets, which are often costly and difficult to obtain. Strategies like data augmentation and semi-supervised learning can help mitigate this issue of data scarcity.
Dealing with large datasets: This is a complementary issue to the previous one. Processing large datasets requires extensive computational resources, and some are too large to fit into the memory of a single machine. Such datasets are often handled using specialized algorithms, such as distributed or online algorithms.
High dimensionality: Working with high-dimensional data poses significant challenges, often referred to as the curse of dimensionality (see Section 6.6). As the number of dimensions increases, data points become increasingly sparse and less informative, causing standard distance metrics to lose their effectiveness. These issues are commonly addressed through feature selection and dimensionality reduction techniques.
Overfitting and underfitting: Achieving the optimal balance between underfitting and over-fitting is a key challenge in model development (see Section 2.6).
Heterogeneity of the data: Handling diverse data types (continuous, categorical, ordinal, counts, etc.) and formats (tabular, graph, text, image, etc.) presents significant challenges. Most machine learning algorithms can only work with numerical input, so data preprocessing is needed to convert non-numerical data types into a numerical format (see Section 3.10).
Data quality issues: Issues such as noisy data, missing values, and redundancy can severely degrade model performance and must be addressed through proper data cleaning and preprocessing.
Complex objective functions: Many learning tasks have complex, nonlinear objective functions with multiple local optima, making optimization difficult. Common methods such as gradient descent generally converge to locally optimal solutions, which may be far inferior to the global optimum.
No theoretical guarantees: Most machine learning algorithms do not provide theoretical guarantees on their performance (i.e., their generalization error). As a result, it is usually impossible to know in advance which algorithm will work best for a given problem, requiring empirical comparison across multiple methods.
Interpretability of the model: Interpretability refers to how well a human can understand the decisions made by the model. Some machine learning models, such as neural networks, behave like “black boxes,” where the decision-making process is not easily explainable. Interpretability is particulary important in critical domains such as finance or healthcare, where model decisions can have a profound impact.
Concept drifts: In dynamic environments, models must continuously adapt to remain accurate, as data distributions may shift due to changing conditions. For example, spam filters adapt over time to detect new forms of unwanted emails as spammers continually change their strategies.
