Folder templates for data science projects

Folder templates for data science projects

There are arguably three pieces of computing jargon and architecture
that we need to teach to avoid a lot of pain later: paths, folders or
directories, and environments. None of these are interesting in and of
themselves but a small investment now will pay dividends. This post
talks about how to organise your files and folders. And unlike MacOS,
iPads, and iPhones: we are keen that you see the nuts and bolts of how
these things work.

  1. Use relative paths (including for symlinks) so that your entire
    project structure is transportable bash ln -s data ./code/data To
    load data within an R script
    r mydata <- readRDS('data/mydata.RDS')
  2. separate inputs and outputs
    • specifically separate and protect the input data
  3. write notes (readme.md)
  4. define what needs to go under version control
  5. think about using a build tool … make, doit, …

A template project structure.

{.bash} mypaper ├── code │   ├── _config │   ├── a0utils │   ├── a1prep │   ├── a2analysis │   ├── data -> ../data │   ├── data-raw -> ../data-raw │   ├── figures -> ../figures │   ├── labbook │   │   └── CCYY-MM-DD.py │   ├── library │   ├── readme.md │   └── tables -> ../tables ├── data ├── data-raw ├── figures ├── filing ├── readme.md ├── tables ├── tmp ├── todo.todo └── write ├── manuscript.md └── readme.md {{EJS1-1}} Put just the ./code directory under version control. Never write to ./data-raw`

For an alternative approach, see Templates for reproducible research
projects

which goes much further and splits directories formally using a build
tool (waf).

Links