Folder templates for data science projects

Folder templates for data science projects

There are arguably three pieces of computing jargon and architecture
that we need to teach to avoid a lot of pain later: paths, folders or
directories, and environments. None of these are interesting in and of
themselves but a small investment now will pay dividends. This post
talks about how to organise your files and folders. And unlike MacOS,
iPads, and iPhones: we are keen that you see the nuts and bolts of how
these things work.

  1. Use relative paths (including for symlinks) so that your entire
    project structure is transportable bash ln -s data ./code/data To
    load data within an R script
    r mydata <- readRDS('data/mydata.RDS')
  2. separate inputs and outputs
    • specifically separate and protect the input data
  3. write notes (
  4. define what needs to go under version control
  5. think about using a build tool … make, doit, …

A template project structure.

{.bash} mypaper ├── code │   ├── _config │   ├── a0utils │   ├── a1prep │   ├── a2analysis │   ├── data -> ../data │   ├── data-raw -> ../data-raw │   ├── figures -> ../figures │   ├── labbook │   │   └── │   ├── library │   ├── │   └── tables -> ../tables ├── data ├── data-raw ├── figures ├── filing ├── ├── tables ├── tmp ├── todo.todo └── write ├── └── {{EJS1-1}} Put just the ./code directory under version control. Never write to ./data-raw`

For an alternative approach, see Templates for reproducible research

which goes much further and splits directories formally using a build
tool (waf).


Sidenotes: Footnotes and Marginal Notes

One of the most distinctive features of Tufte’s style is his extensive use of sidenotes. Sidenotes are like footnotes, except they don’t force the reader to jump their eye to the bottom of the page, but instead display off to the side in the margin. Perhaps you have noticed their use in this document already. You are very astute.
This is a side note. Notice there is a number preceding the note.

Sidenotes are a great example of the web not being like print. On sufficiently large viewports, Tufte CSS uses the margin for sidenotes, margin notes, and small figures. On smaller viewports, elements that would go in the margin are hidden until the user toggles them into view. The goal is to present related but not necessary information such as asides or citations as close as possible to the text that references them. At the same time, this secondary information should stay out of the way of the eye, not interfering with the progression of ideas in the main text.

Sidenotes consist of two elements: a superscript reference number that goes inline with the text, and a sidenote with content. To add the former, just put a label and dummy checkbox into the text where you want the reference to go, like so:

You must manually assign a reference id to each side or margin note, replacing “sn-demo” in the for and the id attribute values with an appropriate descriptor. It is useful to use prefixes like sn- for sidenotes and mn- for margin notes.

Immediately adjacent to that sidenote reference in the main text goes the sidenote content itself, in a span with class sidenote. This tag is also inserted directly in the middle of the body text, but is either pushed into the margin or hidden by default. Make sure to position your sidenotes correctly by keeping the sidenote-number label close to the sidenote itself.

If you want a sidenote without footnote-style numberings, then you want a margin note. This is a margin note. Notice there isn’t a number preceding the note. On large screens, a margin note is just a sidenote that omits the reference number. This lessens the distracting effect taking away from the flow of the main text, but can increase the cognitive load of matching a margin note to its referent text. However, on small screens, a margin note is like a sidenote except its viewability-toggle is a symbol rather than a reference number. This document currently uses the symbol ⊕ (⊕), but it’s up to you.

Margin notes are created just like sidenotes, but with the marginnote class for the content and the margin-toggle class for the label and dummy checkbox. For instance, here is the code for the margin note used in the previous paragraph:

This is a margin note. Notice there isn’t a number preceding the note.

Figures in the margin are created as margin notes, as demonstrated in the next section.

The end of history?

A hundred years later the RCT may seem like the end of this history.1However, in critical care we are more aware than most that this would be a poor ending. Clinical trials have been notoriously fruitless in our field, and despite much promising pre-clinical work, this has been especially true in sepsis research.[Riedemann:2003] The main problem is that the delivery of a clinical trial is akin to measuring the meridian line. These are expensive juggernauts that can only ask one question at a time. Where the answer is subtle then the funds to power the trial machine will be exhausted before a small difference is detected.

There are new strategies that aim to make the clinical trial more agile2 However much we supercharge the randomised trial, it will never be able to keep pace with our need to understand the universe of clinical medicine. If big data is going to be the answer to this then it must show itself deserving of the trust that we place in an RCT. Google and friends are telling us that this will be machine learning and artificial intelligence. However if the diet of machine learning is big data, then we are likely to be disappointed. Methods which learn from data do not alone produce theory. Mendelian inheritance, the structure of the double helix, and the general theory of relativity were not problems with data waiting for machine learning to solve. Yes, it is possible that we could feed IBM’s Watson the position of the stars as documented by the ancients. Watson would likely do a good job of recognising that certain spots of light, the planets3, did not move in the same way as others. But to expect that from this Watson would suggest gravity, the Copernican universe, and Newton’s laws of motion is magical thinking.

  1. The End of History is a 1992 essay by Francis Fukuyama that argued that Western liberal democracy would be the final endpoint social and political development. A quarter of century later this claim seems rather premature. 

  2. This includes both platform trials, and now REMAP (Randomized Embedded Multifactorial Adaptive Platform). Here new treatment options are continuously added and removed, as they are discovered and assessed, and the randomisation is embedded in health care delivery. The EHR can even provide the realtime data collection and feedback loop.”Angus:2015jw” 

  3. from the Greek, wanderers, because their position relative to the other stars was not constant. 

50 Years of Data Science …

50 Years of Data Science: Journal of Computational and Graphical Statistics: Vol 26, No 4:

This paper is a great find. Not the least because the argument (statistics versus data science) was already in full swing 50 years ago.

I have no problem with predictive modelling, but it is a different task. And it does seem that the emphasis on ML has obscured the value (in a pendulum swing from the days of Tukey) on importance of understanding the generative model. From Donoho …

Predictive modeling is effectively silent about the underlying mechanism generating the data, and allows for many different predictive algorithms, preferring to discuss only accuracy of prediction made by different algorithm on various datasets. The relatively recent discipline of machine learning, often sitting within computer science departments, is identified by Breiman as the epicenter of the predictive modeling culture.

I like to think that our lab is pulling hard at the pendulum. That we care massively about the underlying mechanism. That for me is the ‘science’ in ‘data science’. Science because when right it tells us something about how the world works, not just how it will be. The difference between a super accurate weather forecast, and understanding the principles of the atmosphere and the climate. None of that devalues predictive modelling, but these are separate activities.

Abandoning statistical significance is both sensible and practical …

Abandoning statistical significance is both sensible and practical « Statistical Modeling, Causal Inference, and Social Science:

The replication crisis in science is not the product of the publication of unreliable findings. The publication of unreliable findings is unavoidable: as the saying goes, if we knew what we were doing, it would not be called research. Rather, the replication crisis has arisen because unreliable findings are presented as reliable.

A manifesto for our lab?

This eloquent exposition of why clinicians are necessary to data science feels like a manifesto for the lab.

We argue that a failure to adequately describe the role of subject-matter expert knowledge in data analysis is a source of widespread misunderstandings about data science. Specifically, causal analyses typically require not only good data and algorithms, but also domain expert knowledge.

And a general critique of ML as a method to improve health

A goal of data science is to help make better decisions. For example, in health settings, the goal is to help decision-makers—patients, clinicians, policy-makers, public health officers, regulators—decide among several possible strategies. Frequently, the ability of data science to improve decision making is predicated on the basis of its success at prediction. However, the premise that predictive algorithms will lead to better decisions is questionable.

And why the human orrery is a dangerous myth

the distinction between prediction and causal inference (counterfactual prediction) becomes unnecessary for decision making when the relevant expert knowledge can be readily encoded and incorporated into the algorithms. For example, a purely predictive algorithm that learns to play Go can perfectly predict the counterfactual state of the game under different moves, and a predictive algorithm that learns to drive a car can accurately predict the counterfactual state of the car if, say, the brakes are not operated. Because these systems are governed by a set of known game rules (in the case of games like Go) or physical laws with some stochastic components (in the case of engineering applications like self-driving cars),

Or more specifically …

…contrary to some computer scientists’ belief, “causal inference” and “reinforcement learning” are not synonyms. Reinforcement learning is a technique that, in some simple settings, leads to sound causal inference. However, reinforcement learning is insufficient for causal inference in complex causal settings

Git tips

I have been using Git now for a couple of years, and have struggled to understand what it does. I get the basic concept (that it keeps a record of the changes I make to my code) but it also sometimes seems to get in the way. I have read Think like (a) Git and The thing about git in the last couple of days and learned a few really useful things.

In no particular order …


  • the idea that git commits can be ‘wasted’ – you don’t need to keep everything or worry about a commit being perfect. Commit if you feel like it.
  • think of branches as save points (via Think like (a) git)

Crafting a committ (or not)

  • in contrast, you can also ‘craft’ a commit: this is the idea behind the staging area (or index), and is nicely covered in The Thing About Git. Here you can almost imagine writing your git commit message before you commit (i.e. I fixed problem X). Then simply add those files (or parts of files — aka ‘hunks’). Adding ‘hunks’ is the task of …
  • git add --patch – this is genius. When you are preparing a commit then you don’t need to commit an entire file. Running git add --patch allows you to run through the diffs in the file and only stage those parts you wish to. You can also think of this as a bit of a backwards solution to the classic git commit --ammend which allows you to add things you forgot to the previous commit.

Git patch options

y - stage this hunk
n - do not stage this hunk
a - stage this and all the remaining hunks in the file
d - do not stage this hunk nor any of the remaining hunks in the file
j - leave this hunk undecided, see next undecided hunk
J - leave this hunk undecided, see next hunk
k - leave this hunk undecided, see previous undecided hunk
K - leave this hunk undecided, see previous hunk
s - split the current hunk into smaller hunks

Selectively applying changes from one branch to another

Common scenario: work in one branch would be useful in another but you don’t want to merge the branches.

If current branch is this_branch and the branch with the changes you want to pull is called that_branch

To pull across a specific commit:

Git cherry pick will pull just a specific commit, but not necessarily a whole file.

To pull across specific file(s):

git checkout that_branch path/to/myfile1 path/to/myfile2

An interactive patch from your current branch.

git checkout this_branch
git checkout -p that_branch


via Jason Rudolph
SO answer

Using Git for writing

  • git diff --word-diff=color … wow! A way to inspect word by word changes in the file. Much better suited to using Git when writing. I should write more about this whole topic! In the meantime, also note
    • type -S while viewing the diff to wrap wrong lines and make everything more readable (via someone45 at StackOverflow … Thanks!).

Visualising your commits

  • a free git visualization tool called GitX
  • the command line version of the above is git log --graph --decorate --oneline

Finding and restoring something you deleted

git log --diff-filter=D --summary

then restore it

git checkout <deleting_commit>^ -- <file_path>

(via kablamo and Charles Bailey on stackoverflow)

Things I still need to get my head around

  • cherry picking
  • the rebase command


Early notes: imagine a development branch, and a feature branch. While your working on the new feature, you also make (possibly) separate changes to the development branch. You don’t want to destroy the feature (it’s not done yet), but you want these recent changes on the development branch in your feature. Then rebase. ‘Rebasing’ refers to moving the point where your feature branch separated from the development branch forward in time. In fact, you move it forward to the ‘capture’ as many of the recent changes on development as you like.

And if it turns out that there are conflicts between your feature branch and the development branch then you can either manually resolve them or do an ‘interactive’ rebase.

A re-read of Think like (a) Git is in order!

Calculate the Charlson Score in R

It is unlikely that anyone else will have data formatted in the same way as me, but it shouldn’t be too hard to convert this.

You just need a data where the rows represent patients, and there are a series of columns containing indicator variables for the components of the Charlson score[1].

The following function looks for these 20 variables, and uses a quick bit of matrix multiplication to generate the total score.

For reference, here is the table from the original paper containing the weights for the score.

Charlson Score

[1]: Charlson ME, Pompei P, Ales KL, MacKenzie CR. A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. Journal of Chronic Diseases. 1987;40:373-383.

Syntax highlighting for R in the terminal

So R-studio seemed to be running really slowly today which prompted me to try using R in the terminal. This works nicely with R-Box. Otherwise said, type in Sublime Text, and execute in R (via iTerm.)

This all worked much more quickly, and the plots show up in a lovely quartz window. I lose a lot of the easy point-and-click functionality, but I never used the text editor so I don’t miss that.

What I did miss was syntax highlighting. The solution (via StackOverflow as usual) is a super cool little package called colorout.

Before (top half) and after (bottom half) of my screen. Which looks nicer?

141128 iTerm colorout screenshot

Don’t forget to load the package in your

git checkout this_branch
git checkout -p that_branch