Best Practices for Scientific Computing

(Wilson et al., 2014) in PLoS biology

We spend so much time writing code, but software engineering classes are generally garbage. Plus, nobody writes tests for experimental code being run on a cluster. We just lurch from bug to bug until we get a working version.

Today’s paper, recommended by Tom Lippincott, presents a few antidotes for scientific software struggles.

Software is as important to modern scientific research as telescopes and test tubes.… We believe that software is just another kind of experimental apparatus and should be built, checked, and used as carefully as any physical apparatus.

Their advice:

Write Programs for People, Not Computers. Make it pretty, and make it decomposed. Stick to formatting guidelines.

Let the Computer Do the Work. Use a build tool like Make to automate workflows. It can also suss out which pieces depend on which. Make the entire process automatic.

Make Incremental Changes. Both at the code-level and the project-level. Test little pieces rather than building the entire edifice. Use version control.

Don’t Repeat Yourself. That is, copy and paste are the enemies of software development. Instead, make code modular so that you have only one version you need to fix when you’ve erred. If someone else has written something, use theirs. In NLP, use torchtext instead of always writing your own custom data loaders. How many times did I question edge cases in my own Vocab implementations before finding this?

Plan for Mistakes. assert is your best friend. When it’s 2:30 A.M. and you’ve been beating your head against the desk, you might feel certain things are obvious or aren’t. You might take certain things for granted. “That’s totally a probability distribution. I’m definitely sofmaxing on the right axis.” You’ll probably be wrong, because you didn’t sleep at 10 P.M. like a sensible person. Put assert into your code to check your assumptions, basically for free. They’ll kill your program if you were wrong and show you exactly where it happened.

Correct First, Fast Later. Find bottlenecks with a debugger, rather than with your intuition. Write in the highest-level language you can. Python is a champ. C is a chump. Developer hours (your time!) cost more than program hours, so prefer fewer dev hours even if it means your results come a little later. You can already be working on something new while you wait.

Premature optimization is the root of all evil. —Don Knuth

Collaborate. Use all the good features GitHub has to offer: issue tracking, pre-merge code reviews, etc.

The article as a whole is light on details, and it’s likely things you’ve heard before if you’re a grad student in computer science. Still, the gentle reminder about assert every now and then is welcome. It’ll do you some good.

The takeaway:

  • Write code in a way that reduces your cognitive burden and hours spent.
  • Use a build system.
Written on February 16, 2018