Semantic Werks

Thoughts on people, machines and systems.

Posts Tagged ‘msr

A better scientific notebook

with 3 comments

Cameron Neylon is an advocate of open science (along with others like Michael Nielsen). Among other things, open science or Science 2.0 means keeping track of your mundane day-to-day activities (I like to call it “sciencing”) using some publicly accessible repository. This way other researchers (your competition?) can see what you are doing and comment/question/improve your work.  This also has benefits for the researcher herself, of course, in keeping track of what steps were followed. Nothing is worse than getting a good result and not being able to replicate it (“I swear it was there! I saw it!” “Yes of course you did, here, just take this pill..”). Actually there is something worse and that is losing data and not having a backup. Like me.

Recently I was working on an project – MSR/data mining of open source software – and saw one aspect of this that could be improved. I’m working with R, the open-source statistics toolkit to do data analysis. I’ve never used it before, so there’s a fair bit of reading the manual, copying/pasting of examples, and experimentation involved. This can be dangerous: am I selecting a result from my analysis just because it “looks right”, or because I really understand what the analysis is saying? I am also going to want to repeat what got the particular graph I end up with (e.g., change my numeric yearweek dates to R date objects).

Helpfully R has a command line history you can store in a ‘workspace‘. But one thing I would like to be able to do with this, and any command line environment, is to somehow identify the key commands. For instance, I modify an online example to plot a regression line for my data. I would like to somehow send that command to a “repeatable steps” repository to preserve a sense of useful workflow (not just history). The vanilla history typically has a lot of missteps, and going back through it can be a challenge. To make this tool even better, it should somehow parse out the dataset-specific pathnames and variable names. Then I would be left with something like:

> <data.frame.variable> <-  read.csv(file="<csv file name>",head=TRUE, sep=",")
>summary(data.frame.variable)
>data.frame.variable$<index column> <- sequence(nrow(data.frame.variable))
> <dates.variable> <- strptime(as.integer(data.frame.variable$dateweek.variable), "%Y%U")
> plot(dates.variable,data.frame.variable$<values.variable>

Which would allow me to insert my own variable definitions (a meta-language, I guess) and have that run automatically as a script in R.

Written by Neil

2010 January 11 at 12:00

Posted in Uncategorized

Tagged with , , ,

MSR: final thoughts

leave a comment »

Source and data

Here’s my source code and extrapolated data. It’s hosted at Github. I have the raw data, but it’s in the gigabyte range. Email me if you’d like it. Rather than the painful exercise of parsing giga-lines of XML, I’d recommend checking out a project like Bicho, which can extract Bugzilla data from numerous repositories (wish it’d been available for me!).

Where to go from here

I started out with the rough conjecture, based on Lehman’s Seventh Law, that as a software project matures (please say that “Maw-Toors”, not “match-yoors”, kthx) the percentage of time devoted to discussing software quality ought to increase. That is, less time is spent on ‘feature A isn’t working’ and more on ‘let’s make the GUI more swell’. The data doesn’t support this conjecture.

One problem is that the taxonomy I was using to seed my signifiers, ISO9126, is very high level, and covers a range of qualities. For example, it includes the notion of ‘correctness’, as in, the program will behave as expected. Well, clearly such issues would be a concern from day 1. I didn’t find any useful correlation between project age and discussions of software quality. R**2 values were very low, indicating, in my linear model, anyway, that there was no relationship. I also tried with a cyclical model based around release windows, but this also had low correlation.  My final conclusion was this: there is definitely some signal to be extracted from these datasets (mailing lists, bug reports, commit messages). That is, these qualities *are* being discussed by software project participants (I haven’t determined _who_ is discussing them yet).

I think this is an interesting result. Certainly, it suggests that software quality discussions are not stagnant, that there is some type of response. The question is, to what? The last experiment I conducted looked at specific peaks in the curves to determine what external events (discovered from mailing lists, source code, etc.) might have caused a spike. Of course, here we tread into the uncertain but possibly richly rewarding area of qualitative research. What I mean is, there is no way to prove, given a particular event, that it caused the spike — just the chance for a high degree of certainty. But really, that’s all an experiment gives anyway.

A colleague, Jorge Aranda, made the comment that experiments in software engineering aren’t that useful, giving one a false sense of learning something but making little contribution to an overarching theory. I certainly agree that this can be the case, but I also think one of the best ways of using experiments is via contradiction of established/cherished opinions. For the contextual settings, this experiment, for example, has shown that there isn’t a linear relationship between software age and discussions of software quality. It is a useful result because it helps me to refine a theory for how software is evolving over time. It can help me design good case studies, for example (what not to look for; what explains the result).

I guess I don’t think experimentation is only useful for limited tests of a richly developed theory (not that Jorge is saying this). It can also be useful in an exploratory sense. The real question for my result is this: does the fact that there was no pattern seen overwhelm the generalizability problems of selecting eight projects from one Linux-based desktop suite? That is, is it possible that these patterns exist in other sets of data? It could always be the case, but I believe not.

Meir Lehman believes that software needs to be constantly improved if it isn’t to suffer bit-rot and fail to meet user needs. I think a useful extension of this work is to break software into finer-grained components. One might examine the evolution of particular features, such as journaled file systems; particular aspects, like user accounts; particular qualities, like usability; and even more fine-grained details. I suspect that each of these components evolves more or less independently of the others. I plan to do a re-examination of the Linux study of Mike Godfrey and Qiang Tu with that in mind.

Final thoughts

I think to do research, one has to assume some element of risk at the outset of the project. There has to be a reasonable chance that your efforts will have been wasted, that the data won’t show what you expected. I think this is what often bothers me about papers that propose frameworks, or taxonomies, or methodologies. There’s no risk involved in a new framework. How can it go wrong? All I can think of is that the proposer doesn’t refer to prior work or is logically inconsistent. And I don’t mean theory here; new theories are welcome. It’s just I don’t think there are many legitimate, explain/predict theories in (software) research (see Gregor, 2006 for a description I like of the nature of theory in information systems research).

I guess research doesn’t need an empirical basis, but I think it helps. Empirical evidence is like an acid test for a hypothesis or theory — it gives reviewers a way of evaluating the work independent of what the paper claims. Frameworks, on the other hand, are pretty circular, and for me at least, hard to review. Most framework papers use trivial examples as a way of establishing that they ‘work’, but I reject this. This is argument from experience, and it’s pretty meaningless. Really, all I have to do as a reviewer is show one example your framework claims to handle, but cannot, and the paper is bunkum.

At any rate, this project was one of the first in which I wrote a fair bit of code, and took a bit of a risk with my mental model of what the data would show. I highly recommend the process.

Written by Neil

2009 April 28 at 21:56

Posted in Uncategorized

Tagged with , ,

MSR: Case study variability

leave a comment »

This is part of the MSR Challenge series.

To keep the scope of my proposed project in check, I’ve decided to go with a subset of Gnome projects to test my NFR tool on. However, I want to make sure they’re a good cross-section of projects so that I can make some theories about the observations I may make. For example, I should have a mix of project lifespans, and a mix of project sizes (LOC, number of developers, etc.), among other variables. Although the project is restricted to the Gnome forge (a forge being a community of developers, like Sourceforge, MSDN, KDE, etc.), I would ultimately like to take it to other forges to see whether there is a difference there as well.

I selected eight projects, mainly based on my experiences with them, and my first order of selection was software purpose. One of the big absences in the software engineering literature is some taxonomy of software ecologies. I wanted projects that touched on a number of different ecologies, including operating systems, media, utilities, internet, and so on.

In the end, I ended up using mainly the longest-lived projects. They provided the richness needed for a proper longitudinal study. I didn’t see much variation between the two (Nautilus and Evolution). The main change was that in Nautilus, software quality was discussed three times as often, relative to event volume, as Evolution. More work is needed to tease out the causes for this result.

Written by Neil

2009 March 18 at 21:40

Posted in Uncategorized

Tagged with ,

MSR: external events

leave a comment »

This is part of the MSR Challenge series.

Having generated the lion’s share of the data I want to mine, I’ve turned my attention to getting interesting things out. Following a presentation to my research group, I got a few great tips on what they would like to see. One thing that came up was the notion of tying the data points (e.g., 2 mentions of ‘usability’ in 2004-March) to some external events. The idea is that there is a reason behind an increase in mentions/occurrences of the word/concept, beyond my (naïve) initial idea of “heightened interest” in a quality as the project goes on.

So the question becomes, how to extract external events that might be causing these spikes? We can readily imagine many such events: change in developer team, release cycle, source audits, and even more random things. Clearly, it would be impossible to determine all of these. But like a good historian, we must try to account for all the variability using some model of the system: like a biological organism, my contention is that the software reacts to the external events by modifying its phenotypic characteristics.

My first task is to determine what plausible external sources might be and account for these. After that, I am going to audit the frequency plots by looking at how much of the variability is accounted for by these ‘plausible’ sources. Anything remaining is an outlier that I will endeavour to explain (for a subset of the data).

The most obvious external forcing is release date. Particularly in Gnome, which has for a while [since when?] followed a coördinated release cycle, these fixed release dates are sure to prompt people to begin questioning software quality.

I tried to extract this information, at first, from subversion tags in the repository itself. Typically a release changeset gets tagged with the release id for future reference. However, the dataset is quite muddy (imports, inconsistencies, etc). I turned to the #gnome irc channel, and was pointed to the gnome-announce mailing list. There, each release is announced, even relatively minor ones. I should probably determine when this policy was not followed, but it seems reasonably safe to say that it is the rule, rather than the exception (and thus acceptable error). I wrote a little bash script to download (yay wget!) the mail archives (10 years worth), unzip them, and turn them into Apple mail folders. Now I can search by subject title for the project (‘evolution’) I want. I think I may also add these to my corpora (although they might work nicely as a control set).

Written by Neil

2009 February 23 at 12:40

Posted in Uncategorized

Tagged with ,

MSR Challenge: large files revisited

with 2 comments

This is part of the MSR Challenge series.

Ignore my earlier advice with respect to handling large XML files. My new favorite tool is Vim. Although it takes a few minutes to do it, Vim can easily go to the correct line to fix problems. So if your XML is non-validating, just figure out the line number (should be printed in the SaxException), then open Vim with vim +<line_num> <file_name>. It thinks for a while, but opens to the correct line without trouble. Then you can delete the offending characters (highlighted for me as ^S) and save the file (wait a few more minutes).

My first problem was the non-valid characters (control characters); the second problem was a very lengthy string that MySQL won’t accept (exceeds the buffer size). I’m storing the data in MySql for scale. At first, I tried increasing the buffer size for MySQL, which didn’t work. I’m not sure if this was because I didn’t set the correct variable (I’m using the MySQL administrator panel for OSX), but now I’m thinking it’s just a ridiculous amount of text: if you check out the bug report you can see the person with the problem posted 100,000 lines of empty stack trace. Thanks buddy! I’m doing a horrible string concatenation thing to eliminate newlines, which is horribly inefficient, but seems to work on small-scale bug reports.

This type of MySQL error won’t print the line number, but fortunately my parser prints the bug ids as it goes, so I could look for that id and retrieve the line number that way. My tool of choice was sed, and I just looked for:

sed -n '/[ ]360318/{=;p;}'  <file_name>, where 360318 is the bug id. Here’s where I got the syntax. The -n command suppresses output; then I look for a space followed by the bug id as the regular expression; finally the {=;p:} portion prints the match and the line number. I used a space in the regex because you would be surprised how often that six-digit number occurs in a 3 gig file.

I think I could have done this with a complicated sed program — you have to store the bug_id, and only operate on the content in that bug, and since sed is not aware of xml elements, it would involve searching for other regexs as needed. I didn’t feel like becoming a sed hacker, particularly. And yes, I suppose Emacs can manage this as well, but I haven’t tried it. I actually considered ed, since it’s line-based, but I couldn’t figure out how to get it to go the line I wanted. I’m thinking one of my programming maxims will be, “If you are considering a solution involving ed, think of another solution.”

Written by Neil

2009 February 5 at 12:08

Posted in Uncategorized

Tagged with , , ,

Follow

Get every new post delivered to your Inbox.

Join 198 other followers