Home > Articles

This chapter is from the book

2.7 Building a Machine Learning Model

Having outlined the basic components and objectives of a machine learning model, we are now ready to present a typical workflow of building such a model (see Figure 2.10). The process generally involves the following steps:

Figure 2.10

Figure 2.10: The main stages in building a supervised machine learning model. The initial phase involves using a learning algorithm to develop a model from a set of labeled training examples. The trained model is then applied to a test set of new samples to make predictions, and these predictions are compared to their actual labels in order to evaluate the model’s ability to generalize to new, unseen data. In practice, each of these stages consists of multiple steps; refer to the main text for more details.

  1. Problem Definition:

    1. Define the problem you want to solve. Is it a classification problem, regression, clustering, or other?

    2. Clearly specify the inputs and desired outputs. For example, in a handwritten zip code recognition task, the input might be a single digit, the entire zip code, or even an image of the envelope containing the zip code.

    3. Choose an appropriate performance metric to evaluate the model, such as accuracy, F1 score, mean squared error, etc.

  2. Data Collection:

    1. Acquire data relevant to the problem—e.g., from databases, web scraping, or third-party sources. The dataset needs to be representative of the real-world domain of the problem.

    2. For supervised learning tasks, you also need to collect labels for the data, either from human experts or derived from measurements.

  3. Data Cleaning and Preprocessing:

    1. Handle missing values by imputation or removal.

    2. Encode categorical data into numerical format, using techniques like one-hot encoding or ordinal encoding.

    3. Detect and remove outliers, or determine how to handle them.

    4. Normalize or standardize features when required by the algorithm.

  4. Data Exploration:

    1. Compute descriptive statistics to summarize the distribution of each feature.

    2. Visualize the data to understand patterns, relationships, and anomalies.

  5. Feature Engineering:

    1. Create new features that could improve model performance.

    2. Reduce dimensionality if needed using techniques like PCA or t-SNE.

  6. Train-Test Split:

    1. Divide your data into training and test sets. Typically, 70%-80% of the data is used for training and the rest for testing.

    2. Consider using a validation set or cross-validation for model selection and hyperparameter tuning.

  7. Model Selection:

    1. Choose a suitable algorithm for the task. Different algorithms are suited for different types of data and problems:

      • For structured (tabular) data, gradient boosting methods such as XGBoost and LightGBM are often the go-to choice.

      • For text data, models such as naive Bayes and, more recently, transformer-based architectures are commonly used.

      • For image data, convolutional neural networks (CNNs) and vision transformers usually give the best results.

    2. Start with a simple model as a baseline to quickly assess the problem’s difficulty and identify any issues in the data:

      • For regression tasks, you might start with linear regression (Chapter 4).

      • For classification tasks, logistic regression (Chapter 5) or a simple decision tree (Chapter 8) can be a good starting point.

    3. If the baseline underfits (performs poorly on the training set), consider switching to more complex algorithms. These algorithms typically require more data and computation.

    4. Be cautious of overfitting as model complexity increases. Techniques such as regularization or early stopping can help mitigate this.

    5. Model selection is typically iterative: train a model, evaluate it on the validation set (or via cross-validation), and revise based on the results or new insights about the data.

  8. Model Training:

    1. Train the selected model on the training set.

  9. Model Evaluation and Hyperparameter Tuning:

    1. Evaluate the trained model on the validation set (if available) or using cross-validation.

    2. Tune the hyperparameters using techniques such as grid search, random search, or Bayesian optimization (see Section 3.8).

    3. Check for overfitting: if the model performs well on the training set but poorly on the validation set, consider applying regularization, choosing a simpler model, or adjusting the hyperparameters.

    4. This step is typically iterative: tune the hyperparameters, train a new model, evaluate it on the validation set (or via cross-validation), and revise based on the results.

  10. Final Model Training and Evaluation:

    1. Once the best model and hyperparameters are selected, retrain the model on the full training data (including the validation set).

    2. Evaluate the final model on the test set to obtain an unbiased estimate of its performance. Report the result as is, and avoid further tuning of the model.

  11. Model Deployment:

    1. Deploy the model to a production environment or make it accessible via APIs.

    2. Continuously monitor the model’s performance in production. Due to concept drift— changes in the statistical properties of the data over time—the model may become outdated, requiring retraining or other adjustments.

Implementation of the entire workflow in Python is demonstrated in Section 3.4.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.