“A foolish consistency is the hobgoblin of little minds”
— Ralph Waldo Emerson (and PEP 8!)
Numerical experiments executed on computers are a key part of research in subjects like Computer Science and Physics. I’m going to outline how I recommend researchers organise their data.
Be warned, this is a much simplified and opinionated post 🙂 targeted at people with only a small amount of experience in numerics or software development.
The Ingredients of a Good Numerics Bake
- A repository containing files needed to perform the experimentation.
- Code that takes input data and produces experimental results.
- Data that describes the parameters and results of the experiments.
Q. Where should we store everything?
Git has become central to Computer Science experimentation.
Why should you use git?
- It provides a single source of truth. There is no argument between collaborators: the copy in git is the definitive version.
- It avoids repeating yourself (DRY) across machines and collaborators.
- It provides a history of all your work. You can return to a previous script or version of your experiment.
- Because git is distributed, each machine has all the repository data on it, so you have several copies of all your data, an informal backup.
- There is one mode of transport: push and pull between machines. Keeps things simple.
“If it’s not in git, it doesn’t exist.”
I had a previous student who had not developed the habit of regularly checking everything into git. I warned the student that one day, when they were away, I would sneak down to their office, open their machine, take out the hard disks, and smash the platters with a hammer.
Months later, I received an email from the student thanking me for this threat of violence: his entire machine had been stolen from under his desk, but he was able to continue working a few hours later — because everything was in git.
If it’s not in git, it doesn’t exist.
“But I hate git!”
Tough. Don’t be a bad scientist.
“…and I pay your wages!”
Trickier. Your boss doesn’t like it.
“…and I grade your performance!”
The Professorial Workaround
Here, there is a weak manual link between the professor and student. The student provides files from git (easy) to the professor and takes the output of the professor and puts it back in git (not so easy).
“But what about Overleaf, Word, etc.?”
Yes, there are external silos where your data may get trapped. This makes me sad.
Either pay for features to sync to git (£12/month for Overleaf, for example), or employ the above professorial workaround.
What goes in the git repo?
Essentially, almost everything!
A README file explaining the repo, where everything is, references to related publication(s), and how to build code, run experiments, perform analysis, generate plots, etc.
Code and scripts. Including bash and other scripts used for running experiments.
The LaTeX source of your paper.
Related work. BibTeX data but also PDF files of papers! This is very controversial for many nerds, but keeping everything in one place makes it much easier. One of my previous supervisors would start writing the related work section just before a deadline, and having all those papers in one place makes it so much easier.
Jupyter notebooks. Yes, they don’t version control well, but you shouldn’t be doing much development in them anyway. Just store them for reference if you hacked something in a notebook in the exploratory phase of your experiments, or perhaps as a tutorial for others to use your code/libraries.
Plots and diagrams, presuming they’re not going to change much. Otherwise just store the scripts to generate them
Slides of presentations associated with the work (Beamer is a great format for doing this, if you can tolerate its uniformity)
Notes on experimental design, meetings, and brainstorms. Markdown is a good choice of format for that.
Note that for public release, you may wish to use a separate repo (not everyone needs to read your meeting notes!) – just select a subset of this repo.
What does not go in the repo?
- Temporary files.
- Anything that will change and can be quickly generated e.g. PDFs from LaTeX.
- Large amounts of data.
You should add a .gitignore file from the start to avoid these being picked up, especially if collaborating with lots of people.
Our goal is to create a system that:
- Can be shared with and understood by others.
By writing code that is:
The most critical question to ask is:
Can a stranger take your repo, understand what’s going on, re-create your experiments, and use your code?
When building code:
- Make it work
- Make it right
- Make it fast (as much as is needed)
If your code isn’t fast enough, then use profiling to determine where the time is spent, and optimise only those parts that really need it.
Just tell me how to write better code!
(ok, this is a bit of a diversion, but these questions pop up at the same time people are asking me about experimentation)
Some programmers, especially outside of Computer Science, ask this question a lot. Beyond “seek formal training and spend a lot of time developing”, my quick tips are:
- Keep it Simple, Stupid – strive for simple solutions
- Decompose your code – into small, testable, functions and modules
- Limit the use of object orientation – it’s more likely you’ll over-use it than under-use it.
- Write tests! If you don’t already, this could well be the biggest improvement you can make.
- Read code written by other people. (how many literary authors don’t read books by other people???)
And if you’re wondering what M is, it stands for “Meaningful variable and function names”. Mathematicians and Physicists are particularly guilty of this – “well, E is obviously energy” … umm, but what if it stands for expectation? Experiment? Error? You are writing code for others to understand, not for the computer or for you.
How do I know if my code is correct?
You never do! You never know with certainty if your code does what you want it to do.
Write (more) unit tests! This is quite probably the one thing you can do most to improve your coding.
Don’t check in broken code. Your git code should always pass all tests. If you want to check in broken code, learn to branch. It’s possible to have git automatically run all your unit tests before committing/pushing it to GitHub – if you need this discipline, learn how to do it.
What’s a unit test?
It’s common for people using languages like Python and Matlab purely for numerical analysis to be unfamiliar with unit testing, which is something more commonly used in general purpose software development. Someone asked me for an example, so: imagine you have a function in a module “osd” and that function inverts a full rank square matrix. You might write code that tests your function like so:
We call the function under test with an example input, then make assertions about what must be true after calling the function: in this case we make sure the the inverse of the identity matrix is also the identity. Building a bunch of these tests to check different cases, a range of extremes, erroneous and tricky inputs, we can increase our confidence in our code.
In particular, if you find a bug in your code – for example, when running an experiment – you should immediately write a “regression test”, which is just a test that exposes the bug and therefore fails. Then when you fix the code you can rerun the test to check it passes. Now, in the future, if a code change revives the bug somehow, you know you have a test that will expose it – building protection against repeating the same mistakes. Which I do a lot.
Which language should you use for experimentation?
There are two clear answers to this:
a) It depends on the particular circumstance: choose the right tool for the job.
I say this partly in jest, but Python is an excellent choice for data analysis, scripting, experimentation, machine learning, and a very large number of other applications. As such, Python is not going anywhere after two decades of use, and I am confident enough to say it will still be used in another few decades. So learning it is time well spent, if you do not already know it. Python is a good, er, pony to bet on.
Whatever language you use, you’re likely to employ a lot of libraries to run experiments and analyse data. In Python pandas and numpy are good examples. Anyone running your software is going to have to install those packages on their machines, so do them a favour and add a list of dependencies and their versions in your repository. For Python a simple “pip freeze” command will give you a list.
In your experimental directory, which you’ve cloned and checkout from GitHub (of course), place a “data” directory, which will contain all your input and output data for your experiments, anything that is too short-lived, too large or too likely to change to store it in git. Use .gitignore to ensure it isn’t included in the git repo.
The “data” directory should be synced to a filestore that you can access anywhere. For example: Dropbox, Google Drive, OneDrive. If you’re a Linux user, you can use rsync or other tools with rsync.net. There are obviously legal restrictions on where some data can be stored, but I’m assuming here we’re just dealing with numerical simulations of equations and not personal data, for example.
All good scientists know that:
Independent Variables -> Experiment -> Dependent Variables
I mirror this in the structure of my scripts:
Inputs -> Code -> Outputs (& auxiliary data)
Where “auxiliary data” is things like timestamps, hostname of the computer, execution time, etc. that may be useful in planning future experiments or in tracking down problems.
The goal here is to make visible the things that are changing in your experiments. So your “Inputs” are in files, not in the scripts themselves. Then you can eyeball those input files – that is, putting your parameters in a file that you’re able to view in a spreadsheet before an experiment and knowing only those things change from experiment to experiment (include your seeds in those files). Humans are great at spotting patterns in visual data – so take an approach that allows you to look at data in nice tools such as spreadsheets as much as possible. You’ll spot mistakes before you waste a lot of time running experiments you didn’t mean to.
Actually, I’ll go further:
Inputs -> Code -> Outputs (& inputs, & auxiliary data)
That is, the output files from an experiment includes its inputs: this guards against bugs in your experimental scripts, and allows you to validate that you have indeed run the experiments you think you have.
One, Format, To, Rule, Them, All
For experimental inputs and outputs, comma-separated (CSV) files are awesome. You can view them in Excel, you can easily split them and join them together, they’re intuitive to look at. If you can set it up so one line in the CSV is one experiment, they are very easy to work with.
Alternatives include json and XML.
Experimental results for numerically calculated expected errors (epsilon column). Note that most of the columns are a repeat of the inputs, the independent parameters to the experiment.
- Keep a git repo with a clean layout that contains everything except input and output data, which should be stored on a separate file store
- Write code that works, then increase your confidence that it’s correct with unit tests, then make it fast by profiling and optimising.
- Structure your experiments so you can describe them with a CSV file or similar and output results that also contain the independent variables.
…and break the rules when necessary:
“A foolish consistency is the hobgoblin of little minds” – Ralph Waldo Emerson (and PEP 8!)