Analytic vs. Enumerative, one more time
In June, Brad Efron, statistics professor at Stanford, published an overview of the state of statistical theory and practice (Bradley Efron (2020) Prediction, Estimation, and Attribution, Journal of the American Statistical Association, 115:530, 636-655, DOI: 10.1080/01621459.2020.1762613).
Efron contrasts the relatively new methods for prediction, like random forests, with traditional tools like regression. What factors should you use to predict an outcome? For example, what biological and social factors will predict whether a patient will survive a disease like COVID-19? Old-school regression requires the analyst to identify factors and then build an equation that links the factors to the outcome. To make a prediction, insert the values of the factors into the equation.
The random forest, invented by Leo Breiman, uses the power of contemporary computing to build a prediction with modest mathematical machinery. It creates many classification trees and makes the prediction by ‘majority vote’ of the classification trees. The analyst turns on the algorithm and takes the computer at its word, with a verification step. The algorithm is trained on one portion of the original data and then tested on a second portion to estimate the error-rate of the procedure. Usually, the portions are defined by random selection.
One of the major problems with computer-based prediction algorithms is ‘drift’. After Efron offers two examples of drift, he concludes: “prediction is easier for interpolation than extrapolation.” (p. 646).
And: “…drift can be interpreted as a change in the data-generating mechanism.” (p. 646).
If there is a trajectory in time of the data-generating mechanism and you create training and testing sets in a way that ignores this structure, you may be badly fooled.
Efron’s discussion aligns with the enumerative versus analytic distinction W.E. Deming developed more than 70 years ago.
I’ve discussed analytic and enumerative problems in previous posts, here, here, and here.
If you have a problem that has a data generating system unchanging in space and time, then prediction algorithms can be safe to use. We can apply either old-school equations or modern prediction tools.
Most problems in my work involve systems that are not fixed. These systems are not stable in Shewhart’s sense. Statistical estimates of performance, whether mathematically expressed as probabilities for old-school methods or error rates for modern methods, require stability.
Extrapolation is a hard problem, as Efron implies.
Final words from Deming:
“There is no statistical method by which to extrapolate to longer usage of a drug beyond the period of test, nor to other patients, soils, climates, higher voltages, nor to other limits of severity outside the range studied. Side-effects may develop later on. Problems of maintenance of machinery that show up well in a test that covers three weeks may cause grief and regret after a few months. A competitor may stop in with a new product or put on a blast of advertising. Economic conditions change and upset predictions and plans. These are some of the reasons why information on an analytic problem can never be complete, and why computations by use of a loss-function can only be conditional. The gap beyond statistical inference can be filled in only by knowledge of the subject-matter (economics, medicine, chemistry, engineering, psychology, agricultural science, etc.)...” (W. Edwards Deming (1975) “On Probability as a Basis for Action”, The American Statistician, 29:4, 146-152, https://www.tandfonline.com/doi/abs/10.1080/00031305.1975.10477402, p. 148)