Posts Tagged ‘science 2.0’
More on open notebooks
I recently posted about what an open notebook in software science might look like. I think I confused life stream (where life == work
with notebook. From what I’ve seen looking at projects like OpenWetWare, they seem more like Trac or Github then a friendfeed account. You get a wiki to write on, image handling, etc., but it isn’t automated: you have to enter all the data yourself.
This is incredibly useful, but am I right in thinking it is similar to tools software engineers have known for decades? It seems like the innovations are in collaborative editing, version control, and digital data.
What I was imagining was more automatic: whenever your microarray machine ran an experiment, it would auto-enter the results on your open notebook. Similarly for code you might run for statistical analysis (like the R workspace question I raised earlier).
I like the idea of ‘recording’ what you HAVE done (not what you will do, which is more brainstroming, mind-mapping, whiteboarding etc.). It is a very important part of selfish science, which is to say, self-replication (presumably the sine qua non of scientific reproducibility). Here are a few features I think are useful for personal lab notes:
- A wiki with dates.
- Separate entries.
- Graphviz-Dot conversion.
- Semantic markup.
- Inline photos.
- Inline LateX
I’m not saying these notebooks have no value: clearly they do. But I think there is a lot more that could be done with the concept. Particularly using linked data (oh noes! the semantic web!) to import other researchers’ results.
What we really want is a list of steps – some small ‘unit of science’ that can be repeated. We should show this using process models, so we can model loops, branches, and possibly execute them, recompose them. Google Wave is touted as the best thing for this, and I think it’s true. SAP has a version of its business process editor in Wave, and Google itself sees a need for it. Its collaboration feature is useful, but I don’t think it is the real advantage – yet. Right now, Wave’s support for version control (well, history) and its ability to incorporate agents/bots and arbitrary Javascript extensions is more useful. For example, someone has written ‘Watexy’, a Wave bot which can interpret Latex equations.
It’s truly an exciting time to be working in science.
Open science and workflows
I was talking to Jon Pipitone about scientific computing. For a long time this field was mired in the relatively obscure (yet vitally important) field of numerical analysis. Now, however, with the relative interest generated by `ClimateGate’ and open-source software, interest in scientific computing — by which is typically meant computing for scientific disciplines, such as biology, chemistry, physics, and in particular, the software supporting that computing — has grown, particularly with respect to the repeatability of these experiments. An excellent resource to read for an introduction is the Microsoft research report on “4th Paradigm science”.
Spurred on by a post by programmers who have converted relatively opaque C/Fortran code to Python, I wondered what other such projects might be around. The goal being to make the procedures followed more open and understandable by laypeople (as much as that might be possible — just because we know what rain is doesn’t mean we are all climatologists).
I asked him what might be worth trying to convert:
A … particularly nasty, but possible idea would be to convert a single fortran module from an existing climate model over into python, and then use some fancy python-fortran bridge to make they two talk to each other. That way you could slowly convert a model over to python. You’d be forced into, at least partially, keeping the original model architecture. That wouldn’t be ideal, but at least you’d know you were being true to the model (because you could compare output).
Sounds nasty to me. If you were considering rewriting a chunk of a model, I’d suggest starting with NASA’s ModelE (or a newer version). It’s the simplest and littlest, big GCM I’ve seen.
But then I realized that moving code from C/Fortran to Python gains you a little bit of readability, a lot of maintainability, sacrifices speed, and leaves you, ultimately, back at the same point you started: computer code (procedural at that).
There’s a parallel to ‘literate programming‘. What we would really like to do is write these tools in a language that is platform independent and language independent.
Here’s how I see the transition:
1. Cognitive understanding —> 2. Language of science (mathematics, with bio/phys/chem extensions) —> 3. language of platform (R, mathematica, custom code) —> 4. bytecode ==> 5. computer processing –> 6. output representation
We would like to get rid of having to do the second translation, right? So that you can just write in the language of mathematics and have the output (prediction, in the form of graph, chart, numbers) be correct. So I guess there should be two sides to this workflow: one from the natural language to the bytecode, and the other from the bytecode back out to natural representation.
The assumption I’m making is that the further away from bytecode you get the more people have a chance of understanding your work.
Some of this discussion is (uncomfortably) similar to model-driven approaches, of course. The challenge there, for me, has always been that you cannot represent *all* the problem in the model – so you end up with a bunch of custom code anyway. Jon again:
Yup. And the climate scientists will tell you that all the time. There are all sorts of optimisations and workarounds that have to be specified in the code. Not to mention the fact that the way you decide to discretize the mathematics in the papers and which algorithms you choose as implementations are also dependent on the rest of the model/compiler/platform, etc.. So it’s not that we’re trying to replace the second step, but just make it clear what’s happening along the way.
A better scientific notebook
Cameron Neylon is an advocate of open science (along with others like Michael Nielsen). Among other things, open science or Science 2.0 means keeping track of your mundane day-to-day activities (I like to call it “sciencing”) using some publicly accessible repository. This way other researchers (your competition?) can see what you are doing and comment/question/improve your work. This also has benefits for the researcher herself, of course, in keeping track of what steps were followed. Nothing is worse than getting a good result and not being able to replicate it (“I swear it was there! I saw it!” “Yes of course you did, here, just take this pill..”). Actually there is something worse and that is losing data and not having a backup. Like me.
Recently I was working on an project – MSR/data mining of open source software – and saw one aspect of this that could be improved. I’m working with R, the open-source statistics toolkit to do data analysis. I’ve never used it before, so there’s a fair bit of reading the manual, copying/pasting of examples, and experimentation involved. This can be dangerous: am I selecting a result from my analysis just because it “looks right”, or because I really understand what the analysis is saying? I am also going to want to repeat what got the particular graph I end up with (e.g., change my numeric yearweek dates to R date objects).
Helpfully R has a command line history you can store in a ‘workspace‘. But one thing I would like to be able to do with this, and any command line environment, is to somehow identify the key commands. For instance, I modify an online example to plot a regression line for my data. I would like to somehow send that command to a “repeatable steps” repository to preserve a sense of useful workflow (not just history). The vanilla history typically has a lot of missteps, and going back through it can be a challenge. To make this tool even better, it should somehow parse out the dataset-specific pathnames and variable names. Then I would be left with something like:
> <data.frame.variable> <- read.csv(file="<csv file name>",head=TRUE, sep=",") >summary(data.frame.variable) >data.frame.variable$<index column> <- sequence(nrow(data.frame.variable)) > <dates.variable> <- strptime(as.integer(data.frame.variable$dateweek.variable), "%Y%U") > plot(dates.variable,data.frame.variable$<values.variable>
Which would allow me to insert my own variable definitions (a meta-language, I guess) and have that run automatically as a script in R.

