Generalised additive models (GAMs): an introduction – Environmental Computing:
Let’s start with a famous tweet by one Gavin Simpson, which amounts to:
- GAMs are just GLMs
- GAMs fit wiggly terms
- use +
x in your syntax
method = "REML"
- always look at
And some notes on using GAMs to manage time series data
50 Years of Data Science: Journal of Computational and Graphical Statistics: Vol 26, No 4:
This paper is a great find. Not the least because the argument (statistics versus data science) was already in full swing 50 years ago.
I have no problem with predictive modelling, but it is a different task. And it does seem that the emphasis on ML has obscured the value (in a pendulum swing from the days of Tukey) on importance of understanding the generative model. From Donoho …
Predictive modeling is effectively silent about the underlying mechanism generating the data, and allows for many different predictive algorithms, preferring to discuss only accuracy of prediction made by different algorithm on various datasets. The relatively recent discipline of machine learning, often sitting within computer science departments, is identified by Breiman as the epicenter of the predictive modeling culture.
I like to think that our lab is pulling hard at the pendulum. That we care massively about the underlying mechanism. That for me is the ‘science’ in ‘data science’. Science because when right it tells us something about how the world works, not just how it will be. The difference between a super accurate weather forecast, and understanding the principles of the atmosphere and the climate. None of that devalues predictive modelling, but these are separate activities.
Abandoning statistical significance is both sensible and practical « Statistical Modeling, Causal Inference, and Social Science:
The replication crisis in science is not the product of the publication of unreliable findings. The publication of unreliable findings is unavoidable: as the saying goes, if we knew what we were doing, it would not be called research. Rather, the replication crisis has arisen because unreliable findings are presented as reliable.
This eloquent exposition of why clinicians are necessary to data science feels like a manifesto for the lab.
We argue that a failure to adequately describe the role of subject-matter expert knowledge in data analysis is a source of widespread misunderstandings about data science. Specifically, causal analyses typically require not only good data and algorithms, but also domain expert knowledge.
And a general critique of ML as a method to improve health
A goal of data science is to help make better decisions. For example, in health settings, the goal is to help decision-makers—patients, clinicians, policy-makers, public health officers, regulators—decide among several possible strategies. Frequently, the ability of data science to improve decision making is predicated on the basis of its success at prediction. However, the premise that predictive algorithms will lead to better decisions is questionable.
And why the human orrery is a dangerous myth
the distinction between prediction and causal inference (counterfactual prediction) becomes unnecessary for decision making when the relevant expert knowledge can be readily encoded and incorporated into the algorithms. For example, a purely predictive algorithm that learns to play Go can perfectly predict the counterfactual state of the game under different moves, and a predictive algorithm that learns to drive a car can accurately predict the counterfactual state of the car if, say, the brakes are not operated. Because these systems are governed by a set of known game rules (in the case of games like Go) or physical laws with some stochastic components (in the case of engineering applications like self-driving cars),
Or more specifically …
…contrary to some computer scientists’ belief, “causal inference” and “reinforcement learning” are not synonyms. Reinforcement learning is a technique that, in some simple settings, leads to sound causal inference. However, reinforcement learning is insufficient for causal inference in complex causal settings
A follow on from the Charlson score function previously posted. Here are functions to calculate the SOFA score.
- it’s almost inconceivable that your data will be similar to mine, and you will be able to just use these ‘as is’; however, they might provide a useful skeleton.
- there are some add-ons included (e.g. if a blood gas is not available then you can still generate the SOFA respiratory score using oxygen saturations and the S:F ratio via this (slightly flawed) proposal)
- there are some arbitrary decisions too (i.e. vasopressin use is considered to assign patients to 4 SOFA points for cardiovascular dysfunction)
Stopping rules and regression to the mean — Statistics Done Wrong:
if the trial is only half complete but there’s already a statistically significant difference in symptoms with the new medication, the researchers may terminate the study, rather than gathering more data to reinforce the conclusion.
When poorly done, however, this can lead to numerous false positives.
I have been using the bootstrap more often recently, but the data that I use is typically structured with patients nested in hospitals. The wonderful Cross Validated recommends that any sampling that is to be done should respect the structure of the data.
This means first sampling (with replacement) hospitals, and then sampling (with replacement again) within each hospital before re-assembling the data.
There is a better explanation along with a code snippet from the biostats department at Vanderbilt. However, with 48 hospitals and 15,000 patients, this ran very slowly.
I have re-written this using the data.table with a good (great?) improvement in speed (but some loss of flexibility).