50 Years of Data Science: Journal of Computational and Graphical Statistics: Vol 26, No 4:
This paper is a great find. Not the least because the argument (statistics versus data science) was already in full swing 50 years ago.
I have no problem with predictive modelling, but it is a different task. And it does seem that the emphasis on ML has obscured the value (in a pendulum swing from the days of Tukey) on importance of understanding the generative model. From Donoho …
Predictive modeling is effectively silent about the underlying mechanism generating the data, and allows for many different predictive algorithms, preferring to discuss only accuracy of prediction made by different algorithm on various datasets. The relatively recent discipline of machine learning, often sitting within computer science departments, is identified by Breiman as the epicenter of the predictive modeling culture.
I like to think that our lab is pulling hard at the pendulum. That we care massively about the underlying mechanism. That for me is the ‘science’ in ‘data science’. Science because when right it tells us something about how the world works, not just how it will be. The difference between a super accurate weather forecast, and understanding the principles of the atmosphere and the climate. None of that devalues predictive modelling, but these are separate activities.
Abandoning statistical significance is both sensible and practical « Statistical Modeling, Causal Inference, and Social Science:
The replication crisis in science is not the product of the publication of unreliable findings. The publication of unreliable findings is unavoidable: as the saying goes, if we knew what we were doing, it would not be called research. Rather, the replication crisis has arisen because unreliable findings are presented as reliable.
This eloquent exposition of why clinicians are necessary to data science feels like a manifesto for the lab.
We argue that a failure to adequately describe the role of subject-matter expert knowledge in data analysis is a source of widespread misunderstandings about data science. Specifically, causal analyses typically require not only good data and algorithms, but also domain expert knowledge.
And a general critique of ML as a method to improve health
A goal of data science is to help make better decisions. For example, in health settings, the goal is to help decision-makers—patients, clinicians, policy-makers, public health officers, regulators—decide among several possible strategies. Frequently, the ability of data science to improve decision making is predicated on the basis of its success at prediction. However, the premise that predictive algorithms will lead to better decisions is questionable.
And why the human orrery is a dangerous myth
the distinction between prediction and causal inference (counterfactual prediction) becomes unnecessary for decision making when the relevant expert knowledge can be readily encoded and incorporated into the algorithms. For example, a purely predictive algorithm that learns to play Go can perfectly predict the counterfactual state of the game under different moves, and a predictive algorithm that learns to drive a car can accurately predict the counterfactual state of the car if, say, the brakes are not operated. Because these systems are governed by a set of known game rules (in the case of games like Go) or physical laws with some stochastic components (in the case of engineering applications like self-driving cars),
Or more specifically …
…contrary to some computer scientists’ belief, “causal inference” and “reinforcement learning” are not synonyms. Reinforcement learning is a technique that, in some simple settings, leads to sound causal inference. However, reinforcement learning is insufficient for causal inference in complex causal settings