Developments in Data Science

Developments in Data Science

David Donoho, on the occasion of John Tukey’s Centennial workshop last September, has summarized “50 years of Data Science”.

It’s an article well worth reading for anyone interested in future developments in data science and the practice of statistics, http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf

After reviewing recent developments (and hype) in the intertwined discussion of Big Data and Data Science, Donoho looks back 50 years at seminal contributions by Tukey, Cleveland, Chambers and Brieman.

Along the way, Donoho formulates a description of Greater Data Science (GDS), in contrast to a narrower version (Lesser Data Science) driven by commercial developments and interests currently in the news.

Donoho starts his history with a discussion of Tukey’s 1962 article, “The Future of Data Analysis”, in the Annals of Statistics.(https://projecteuclid.org/download/pdf_1/euclid.aoms/1177704711) Tukey defined Data Analysis and described how it should be seen as a science, not as a branch of mathematics.

Donoho then jumps ahead several decades to cite 1993 insights from John Chambers, developer of the S statistical language (ancestor to today’s R). In the article “Greater or Lesser Statistics: A Choice for Future Research” (Statistics and Computing, 3:4, pp. 182-184, https://statweb.stanford.edu/~jmc4/papers/greater.ps), Chambers argues for a broader aim of statistical research, “based on an inclusive concept of learning from data.” (from the abstract.)

Donoho next discusses two articles from 2001.

The first, William Cleveland’s “Data Science: An Action Plan for Expanding the Technical Areas of the field of Statistics” (ISI Review, 69, 21-26) proposed that heavily mathematical statistical theory should usefully comprise only 20% of an academic preparation in Data Science—a far cry from the structure of any university department in statistics over the past 50 years.(https://utexas.instructure.com/files/35465950/download)

The second 2001 article, Leo Brieman’s “Statistical Modeling: The Two Cultures”, (Statistical Science, 16, 3, 199-231 http://projecteuclid.org/euclid.ss/1009213726) contrasted and compared two primary goals in working with data. Breiman contrasted a focus on Information (inference and an insistence on specific mathematical models) with a focus on prediction. Breiman claimed that only 2% of academic statisticians spent their time and energy on problems of prediction.

Despite the impressive credentials and insights of Donoho’s quartet of statisticians, the academic field of statistics has not moved very far or fast in the direction they’ve outlined. Academic statistics risks being eclipsed by Lesser Data Science, the Data Science in the popular press and in university Deans’ minds that has less integrity and potential impact than Donoho’s GDS alternative.

The Deming Connection

Donoho’s historical review and Data Science recommendations reminded me of W.E. Deming’s 1975 article, “On Probability as a Basis for Action” (W. Edwards Deming (1975) “On Probability as a Basis for Action”, The American Statistician, 29:4, 146-152, https://www.deming.org/media/pdf/145.pdf). Deming also failed to move academic statistics departments very far from their traditional focus.

Deming distinguished between enumerative studies and analytic studies, roughly the same dichotomy flagged by Breiman; Deming cited work by industrial statisticians in the 1940s as progenitors of the distinction.

Enumerative studies are an arena for mathematical modeling for a fixed population, operationalized by a sampling frame. Analytic studies, in contrast, are focused on future performance, the problem of prediction.

Despite advances in machine learning and access to ever-larger data sets, Deming’s distinction remains important: unless the mechanisms (set of causes) of the system under study remain essentially the same in the future, “best” predictive models and compelling insights from a batch of data (large or small) can fail badly as a guide for decisions and actions tomorrow. Shewhart built his control chart theory and tools to assess the evidence that a set of causes is “about the same” over time, which suggests a role for control chart thinking in any new Data Science.

Deming noted:

“It is important to remember that the mean, the variance, the standard error, likelihood, and many other functions of a set of numbers, are symmetric. Interchange of any two observations xi and xj leaves unchanged the mean, the variance, and even the distribution itself. Obviously, then, use of variance and elaborate methods of estimation buries the information contained in the order of appearance in the original data, and must therefore be presumed inefficient until cleared.” (p. 149)

Donoho cites an example from the Tukey Centennial that illustrates the issue in a completely modern setting: “Rafael Irizarry gave a convincing example of exploratory data analysis of GWAS [Genome Wide Association Study] data, studying how the data row mean varied with the date on which each row was collected, convince[d] the field of gene expression analysis [has] to face up to some data problems that were crippling their studies.” (footnote 29, p. 23).

Donoho’s quartet worked extensively on real data problems; Deming worked on national-scale data surveys in the mid-20th century at the U.S. Census and in his consulting practice.

Perhaps the ideas of Greater Data Science sketched by Donoho naturally arise when very smart, skilled people interact with challenging data problems!

More on Randomized Control Trials: The Views of Philosopher Nancy Cartwright

Limits to Naïve Application of Fisher’s Advice in Social Science