Semantic Werks

Thoughts on people, machines and systems.

A better scientific notebook

with 3 comments

Cameron Neylon is an advocate of open science (along with others like Michael Nielsen). Among other things, open science or Science 2.0 means keeping track of your mundane day-to-day activities (I like to call it “sciencing”) using some publicly accessible repository. This way other researchers (your competition?) can see what you are doing and comment/question/improve your work.  This also has benefits for the researcher herself, of course, in keeping track of what steps were followed. Nothing is worse than getting a good result and not being able to replicate it (“I swear it was there! I saw it!” “Yes of course you did, here, just take this pill..”). Actually there is something worse and that is losing data and not having a backup. Like me.

Recently I was working on an project – MSR/data mining of open source software – and saw one aspect of this that could be improved. I’m working with R, the open-source statistics toolkit to do data analysis. I’ve never used it before, so there’s a fair bit of reading the manual, copying/pasting of examples, and experimentation involved. This can be dangerous: am I selecting a result from my analysis just because it “looks right”, or because I really understand what the analysis is saying? I am also going to want to repeat what got the particular graph I end up with (e.g., change my numeric yearweek dates to R date objects).

Helpfully R has a command line history you can store in a ‘workspace‘. But one thing I would like to be able to do with this, and any command line environment, is to somehow identify the key commands. For instance, I modify an online example to plot a regression line for my data. I would like to somehow send that command to a “repeatable steps” repository to preserve a sense of useful workflow (not just history). The vanilla history typically has a lot of missteps, and going back through it can be a challenge. To make this tool even better, it should somehow parse out the dataset-specific pathnames and variable names. Then I would be left with something like:

> <data.frame.variable> <-  read.csv(file="<csv file name>",head=TRUE, sep=",")
>summary(data.frame.variable)
>data.frame.variable$<index column> <- sequence(nrow(data.frame.variable))
> <dates.variable> <- strptime(as.integer(data.frame.variable$dateweek.variable), "%Y%U")
> plot(dates.variable,data.frame.variable$<values.variable>

Which would allow me to insert my own variable definitions (a meta-language, I guess) and have that run automatically as a script in R.

Written by Neil

2010 January 11 at 12:00

Posted in Uncategorized

Tagged with , , ,

3 Responses

Subscribe to comments with RSS.

  1. I recommend a tool that can tagged and timestamped notes. Sometimes these lab notes help quite a bit. I also recommend using version control for everything you’re doing and carefully documenting each commit. This commit document is your research log :)

    The tools are there all that is needed is the will power on your part.

    Anonymous

    2010 January 11 at 17:24

    • Yes, those ideas are good – I use TiddlyWiki and Git for my work. However, re-reading these notes or recreating workflows from commits done months in the past is not quite what I was getting at. What you really need to do is document why certain constants are used, why you are excluding things below a certain threshold, and so on.

      Interestingly I think these small but sometimes very important decisions are very hard to pick up in peer review.

      Neil

      2010 January 11 at 20:03

  2. I think its a really interesting question what the threshold is for different purposes. I mean that there is no reason not to record everything, because it is “easy” and comprehensive. But when you’re presenting that for some specific person or system for a specific purpose you will want to summarize it in some way. The choices you make seem to me to depend on what the purpose of your communication is and what/who the target is.

    Trivial example, if you want to show to a software developer a problem with their system you want a different kind of summary than the cleaned up and streamlined version that you might submit with a paper. But there are a lots of subtleties here. What do you think about the ideas I suggested about capturing the relationships between the objects you created? Does that work in your context or is there too much command line work between the creation of the relevant objects?

    Cameron Neylon

    2010 January 16 at 06:17


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 384 other followers