Semantic Werks

Thoughts on people, machines and systems.

Open science and workflows

with 2 comments

I was talking to Jon Pipitone about scientific computing. For a long time this field was mired in the relatively obscure (yet vitally important) field of numerical analysis. Now,  however, with the relative interest generated by `ClimateGate’ and open-source software, interest in scientific computing — by which is typically meant computing for scientific disciplines, such as biology, chemistry, physics, and in particular, the software supporting that computing — has grown, particularly with respect to the repeatability of these experiments.  An excellent resource to read for an introduction is the Microsoft research report on “4th Paradigm science”.

Spurred on by a post by programmers who have converted relatively opaque C/Fortran code to Python, I wondered what other such projects might be around. The goal being to make the procedures followed more open and understandable by laypeople (as much as that might be possible — just because we know what rain is doesn’t mean we are all climatologists).

I asked him what might be worth trying to convert:

A … particularly nasty, but possible idea would be to convert a single fortran module from an existing climate model over into python, and then use some fancy python-fortran bridge to make they two talk to each other.  That way you could slowly convert a model over to python. You’d be forced into, at least partially, keeping the original model architecture.  That wouldn’t be ideal, but at least you’d know you were being true to the model (because you could compare output).

Sounds nasty to me.  If you were considering rewriting a chunk of a model, I’d suggest starting with NASA’s ModelE (or a newer version). It’s the simplest and littlest, big GCM I’ve seen.

But then I realized that moving code from C/Fortran to Python gains you a little bit of readability, a lot of maintainability, sacrifices speed, and leaves you, ultimately, back at the same point you started: computer code (procedural at that).

There’s a parallel to ‘literate programming‘. What we would really like to do is write these tools in a language that is platform independent and language independent.

Here’s how I see the transition:
Science workflow1.  Cognitive understanding —> 2. Language of science (mathematics, with bio/phys/chem extensions) —> 3. language of platform (R, mathematica, custom code) —> 4. bytecode ==> 5. computer processing –> 6. output representation

We would like to get rid of having to do the second translation, right? So that you can just write in the language of mathematics and have the output (prediction, in the form of graph, chart, numbers) be correct. So I guess there should be two sides to this workflow: one from the natural language to the bytecode, and the other from the bytecode back out to natural representation.

The assumption I’m making is that the further away from bytecode you get the more people have a chance of understanding your work.

Some of this discussion is (uncomfortably) similar to model-driven approaches, of course. The challenge there, for me, has always been that you cannot represent *all* the problem in the model – so you end up with a bunch of custom code anyway. Jon again:

Yup.  And the climate scientists will tell you that all the time.  There are all sorts of optimisations and workarounds that have to be specified in the code.  Not to mention the fact that the way you decide to discretize the mathematics in the papers and which algorithms you choose as implementations are also dependent on the rest of the model/compiler/platform, etc..  So it’s not that we’re trying to replace the second step, but just make it clear what’s happening along the way.
Advertisement

Written by Neil

2010 February 1 at 11:41

2 Responses

Subscribe to comments with RSS.

  1. I think one of the biggest challenges is to get design choices about parallelization and algorithm optimization up there in the “language of science” representation, so that these are no longer an afterthought.

    Steve Easterbrook

    2010 February 1 at 12:46

    • But then we get into the challenge of educating scientists about parallelization and optimization and discretization. Topics even experienced programmers don’t understand very well.

      Science was easier when we just had slide rules and log tables.

      Neil

      2010 February 2 at 12:06


Leave a Reply

Fill in your details below or click an icon to log in:

Gravatar
WordPress.com Logo

You are commenting using your WordPress.com account. Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 198 other followers