2012-02-22

Open Computer Programs

I've been reading a new software engineering paper from Les "Safer C" Hatton, Darryl "Torpedo" Ince and John Graham-Cumming: The Case for Open Computer Programs in Nature. Fittingly, the full text is online and it's open to public comment.

The position I'd take is that you can't make a refutable argument based on data if your opponent can't himself analyse and run your programs to:

  1. replicate your claims,;
  2. experiment with the effect of varying your starting assumptions; and
  3. debug so that they can identify errors in your code or inconsistencies in your data.
If you don't want to make a refutable argument, you're not actually doing science; it's just propaganda.

The examples quoted include jgc's analysis of the UK Met Office + UEA's Climate Research Unit code for processing global temperature data, and a study performed on seismic data processing algorithms: in both cases, the vast array of data and complex processing meant that errors were simply not visible to the original authors, and it took significant effort from other parties to highlight the problems.

I particularly liked:

One proposed solution to the problem of ambiguity is to devote a large amount of attention to the description of a computer program, perhaps expressing it mathematically or in natural language augmented by mathematics. But this expectation would require researchers to acquire skills that are only peripheral to their work (set theory, predicate calculus and proof methods).
Or, expressed more succinctly:
Make it possible for the programmers to write in English, and you will find that programmers cannot write in English.

As the authors note, even releasing the full source may not make replication of results easy: for instance, most compiled programs (C, C++, even Ada) are notorious for producing different results on different machines in various edge cases. Building a complex program from source is far from trivial, even assuming that you have a reasonable Makefile, all the tools and dynamic library versions required. But it's a start, and at least documenting the exact build machine OS/architecture, build process and versions used gives the opponent something to get their teeth into. And if the source is visible, you don't need to wonder how the program resolved a certain case - you can step through the code in question and work it out with pencil and paper.

[Hat tip: jgc himself]

No comments:

Post a Comment

All comments are subject to retrospective moderation. I will only reject spam, gratuitous abuse, and wilful stupidity.