Home > Articles > Business & Management

  • Print
  • + Share This
This chapter is from the book

Statistics and Machine Learning

There are two classes of techniques for predictive analytics with very different legacies: statistical methods and machine learning.

Statistical methods, such as linear regression, estimate the parameters of mathematical models with known properties; the analyst seeks to test the hypothesis that the behavior of interest conforms to a specific class of mathematical model. The advantage of these models is that they are highly generalizable. If you can demonstrate that historical data conforms to a known distribution, you can use this information to predict behavior for new cases.

For example, if you know the position, velocity, and acceleration of an artillery shell, you can predict where it will land because you can use a mathematical model to compute the point of impact. By analogy, if you can show that response to a marketing campaign follows a known statistical distribution, you can predict response with a degree of confidence based on information about the customer’s past purchases, demographics, characteristics of the offer, and so forth.

The principal disadvantage of statistical methods is that real-world phenomena frequently do not conform to known statistical distributions.

Machine learning techniques differ fundamentally from statistical techniques because they do not start from a particular hypothesis about behavior; instead, they seek to learn and describe the relationship between historical facts and target behavior as closely as possible. Because machine learning techniques are not constrained by specific statistical distributions, they are often able to build models that are more accurate.

However, machine learning techniques can overlearn, which means they learn relationships in the training data that cannot generalize to the population. Consequently, most widely used machine learning techniques have built-in mechanisms to control overlearning, such as cross-validation or pruning on an independent sample.

The distinction between statistics and machine learning is getting smaller, as the two fields converge; for example, stepwise regression is a hybrid method based on both traditions.

  • + Share This
  • 🔖 Save To Your Account