Posts Tagged ‘msr’
MSR: non-functional requirements over time
This is part of the MSR Challenge series.
The challenge I’m undertaking is to report on something ‘interesting’ (and novel) about projects under the Gnome umbrella. Since my thesis is on requirements evolution, I’ve tried to use the Challenge to help me understand the nature of the evolution problem. I suppose the first issue is to convince one that there is a problem. Are requirements changing over time? What factors are driving that change?
Rather than looking at functional requirements, which can vary tremendously across projects (I have my opinions about this but more later), I’ve chosen to look instead at software qualities, aka ‘ilities’ aka NFRs. These are notions such as usability, security, and maintainability. These concepts can be said, tentatively, anyway, to transcend software ecosystems. That is, every piece of software can be judged according to its fulfilment of these ‘soft goals’. In particular, I’ve decided to use the ISO 9126 standard, which lists a brief taxonomy of software qualities. The advantage is that it is relatively well-known, and ‘standard’, for what that’s worth. The problem with using it is that there are as many taxonomies of software qualities as corrupt bankers on Wall Street. Some might argue that even non-functional qualities are not universally applicable.
How am I going to apply this model to the corpus? I’m doing this somewhat iteratively, in that I’m going to start simple and build from there. So for now, my goal is to show how these qualities are discussed over time in a given project. My guiding framework is to model word use (and related terms) as events in the timeline of a project. So if a developer or user mentions ‘usability’ on a mailing list, that becomes an event in my system. Then I record these events and use a timeline to show how they occur over time. Not a terribly sophisticated NLP technique, but let’s see where it gets us before delving into Markov models, naive Bayesian algorithms, and so on.
I have a few hypotheses about the presence of my events, relating to how NFRs are more or less important (as gauged by frequency of mention) at various lifestages of a software system, for numbers of developers, for project focus (media, utility, OS, etc.), and so on. More on these later.
MSR Challenge: introduction
This is part of the MSR Challenge series.
In the midst of my work on the project, I’ve just reported on the details. Here’s the overview.
The MSR workshop is a heavily empirical software engineering conference focused on open-source software repositories. A repository typically contains some or all of the following: source code with revisions; mailing lists; bug trackers; websites; user information; etc. Most of MSR is concerned with the source code, but that’s not (for me) necessarily the most interesting.
In the past 10-15 years, these data sources have grown tremendously, giving us a good opportunity to work with real-world data. Of course, these projects are not representative of the industrial, closed-source world, or rather, we don’t have a good handle on how representative they are.
As an aside, it’s high time someone constructed a software ecosystem guide, because comparing SAP implementations to Gnome music players is meaningless in almost any context I can think of. That’s not to say the open-source code is lower quality; on the contrary, I suspect they may be higher quality than your average corporate website written in VB/ASP.
Every MSR workshop has a challenge component; this year’s challenge is to use projects hosted at Gnome.org as the dataset. There are two challenges: 1) predict growth of Gnome projects (following various theories of software evolution and using different predictive models); 2) report on something ‘interesting’ learned from (all or a subset of) these sources.
I’ve chosen the second challenge, and my next post will go into more detail about the ‘interesting’ thing I hope to uncover. Part of my motivation is to help my own reasoning; the other part is to document the methodology I use for the report.
This series is me blogging my way through the project. The due date is March, so I expect to be reporting at odd intervals until then. And if all goes well, look for me in Vancouver in May!
MSR Challenge: large data files
This is part of the MSR Challenge series.
This week’s challenge was to tackle the Bugzilla logs. The data I have is a complete dump of the Gnome bugzilla database, which is enormous — 3.5 gigs. Of course, I don’t need all of this data, just the data that relates to particular projects (e.g., Deskbar).
I think there are two approaches. One is bottom-up, so to speak, where I focus on the details of a specific bug. For this study, I need to extract the date the bug was filed and the subject and comment for that posting. Bugzilla allows follow-up comments, so I am going to extract those as separate GnomeDataObjects as well. For now, I am discarding metadata such as the bug’s status and whether it is an error or enhancement (other researchers are working there). I’m just focusing on the bug comments/description and the date of that comment. If the comment is a system event, such as “BUG XXX is a duplicate of YYY” then I ignore those events.
So I have some hacky code that reads in my sample xml file and parses the relevant nodes to extract that data. Hacky, because I’m too lazy to write a search for the node I want, and I’m just using list positions (e.g., entry.ChildNodes[4].childNodes[0].data where entry is a bug).
As I was doing this, I realized that this isn’t going to scale too well on a 3 gigabyte file. Rather than do a time-consuming iteration over all the bugs, throwing out those that aren’t in the appropriate project, I need to do some sort of stream processing … which is where SAX comes in.
My other consideration was whether I should just munge this all into a huge database, but I can’t find a simple enough way to do that online. The third alternative would be to write XSLT to split the file or reject the unwanted elements, but I don’t know XSLT at all. And from all accounts, I’m better off that way
SAX is like an impoverished event-processor: it calls specific methods (e.g., beginElement) when that event occurs (e.g., it encounters an XML element). This is cool, in that it’s like just-in-time processing, and avoids a huge in-memory data structure; on the other hand, your code (at least MY code) is full of if statements and switches, which is crapulent. But it looks to be working, for now.
Here’s some sample data as an illustration:
2002-05-28 16:29:00
scaffold
xxxx@hotmail.com
2002-05-28 15:31:00
Package: anjuta2
Severity: normal
Version: 0.11.3
Synopsis: The Matrix
Bugzilla-Product: anjuta2
Bugzilla-Component: shell
Description:
I was in the site www.thematrix.com and the browser crached !
Thkz
Debugging Information:
[some stack trace]
xxxx@hmc.edu
2002-07-29 14:03:00
*** This bug has been marked as a duplicate of 59361 **
MSR Challenge: data cleaning
This is part of the MSR Challenge series.
I’ve been spending the last few days writing my parsing code for the Gnome data sets. Friday and Sunday have been dedicated to parsing mailing list data. I got a dump of a sample mailing list, the Deskbar Applet list. It’s in MySQL format, so I loaded it into Mysql without any major trouble. I then imported the python MySQLdb (case matters!) library, and got to work.
I open a connection to the db with the ‘deskbar’ table, then run a query: "SELECT message_body, original_date, subject FROM messages". I then do cursor.fetchall() to retrieve the rows (1091 rows). Since I used a MySQLdb.cursors.DictCursor, the resultset is a dictionary with the column names as the keys. Now it’s a simple matter of passing those things into my GnomeDataObject schema (event, date, RSN) as appropriate. Again, simple, right?
Well, no. I usually try all this from the REPL environment in Python, and occasionally I would see this strange result for the message body: '[<email.Message.Message instance at 0xb76e324c>, <email.Message.Message instance at 0xb769a90c>]'. What puzzled me was that this is actually Python, using the email system library. At first I thought the mysql connector was doing type conversion, but sadly, this is really just a string.
So the message that is supposed to be like this is output into the db file as some sort of weird serialization result. From what I can tell so far, that’s because the person who wrote the dump script against the list mbox files made a mistake when parsing some special type of message, threaded or something like that.
From inspection, this problem affects around 25% of the messages in the list. This is definitely a serious problem for validity, although if it is random (and not, for example, just occuring to one mailing list contributor, like one who is really interested in some NFR), then no biggy.
My next step will be to try and reproduce this with another mailing list dump file. If it’s still a problem, I’ll need to contact the person who produced the data or else just extract the data from the mbox files myself.