Tag: statistics

50 Years of Data Science …

50 Years of Data Science: Journal of Computational and Graphical Statistics: Vol 26, No 4:

This paper is a great find. Not the least because the argument (statistics versus data science) was already in full swing 50 years ago.

I have no problem with predictive modelling, but it is a different task. And it does seem that the emphasis on ML has obscured the value (in a pendulum swing from the days of Tukey) on importance of understanding the generative model. From Donoho …

Predictive modeling is effectively silent about the underlying mechanism generating the data, and allows for many different predictive algorithms, preferring to discuss only accuracy of prediction made by different algorithm on various datasets. The relatively recent discipline of machine learning, often sitting within computer science departments, is identified by Breiman as the epicenter of the predictive modeling culture.

I like to think that our lab is pulling hard at the pendulum. That we care massively about the underlying mechanism. That for me is the ‘science’ in ‘data science’. Science because when right it tells us something about how the world works, not just how it will be. The difference between a super accurate weather forecast, and understanding the principles of the atmosphere and the climate. None of that devalues predictive modelling, but these are separate activities.

Abandoning statistical significance is both sensible and practical …

Abandoning statistical significance is both sensible and practical « Statistical Modeling, Causal Inference, and Social Science:

The replication crisis in science is not the product of the publication of unreliable findings. The publication of unreliable findings is unavoidable: as the saying goes, if we knew what we were doing, it would not be called research. Rather, the replication crisis has arisen because unreliable findings are presented as reliable.

A manifesto for our lab?

This eloquent exposition of why clinicians are necessary to data science feels like a manifesto for the lab.

We argue that a failure to adequately describe the role of subject-matter expert knowledge in data analysis is a source of widespread misunderstandings about data science. Specifically, causal analyses typically require not only good data and algorithms, but also domain expert knowledge.

And a general critique of ML as a method to improve health

A goal of data science is to help make better decisions. For example, in health settings, the goal is to help decision-makers—patients, clinicians, policy-makers, public health officers, regulators—decide among several possible strategies. Frequently, the ability of data science to improve decision making is predicated on the basis of its success at prediction. However, the premise that predictive algorithms will lead to better decisions is questionable.

And why the human orrery is a dangerous myth

the distinction between prediction and causal inference (counterfactual prediction) becomes unnecessary for decision making when the relevant expert knowledge can be readily encoded and incorporated into the algorithms. For example, a purely predictive algorithm that learns to play Go can perfectly predict the counterfactual state of the game under different moves, and a predictive algorithm that learns to drive a car can accurately predict the counterfactual state of the car if, say, the brakes are not operated. Because these systems are governed by a set of known game rules (in the case of games like Go) or physical laws with some stochastic components (in the case of engineering applications like self-driving cars),

Or more specifically …

…contrary to some computer scientists’ belief, “causal inference” and “reinforcement learning” are not synonyms. Reinforcement learning is a technique that, in some simple settings, leads to sound causal inference. However, reinforcement learning is insufficient for causal inference in complex causal settings

Calculate the SOFA score in R

A follow on from the Charlson score function previously posted. Here are functions to calculate the SOFA score.

Please note

  • it’s almost inconceivable that your data will be similar to mine, and you will be able to just use these ‘as is’; however, they might provide a useful skeleton.
  • there are some add-ons included (e.g. if a blood gas is not available then you can still generate the SOFA respiratory score using oxygen saturations and the S:F ratio via this (slightly flawed) proposal)
  • there are some arbitrary decisions too (i.e. vasopressin use is considered to assign patients to 4 SOFA points for cardiovascular dysfunction)

Bootstrap at the cluster or the unit level

I have been using the bootstrap more often recently, but the data that I use is typically structured with patients nested in hospitals. The wonderful Cross Validated recommends that any sampling that is to be done should respect the structure of the data.

This means first sampling (with replacement) hospitals, and then sampling (with replacement again) within each hospital before re-assembling the data.

There is a better explanation along with a code snippet from the biostats department at Vanderbilt. However, with 48 hospitals and 15,000 patients, this ran very slowly.

I have re-written this using the data.table with a good (great?) improvement in speed (but some loss of flexibility).