A lot of work has gone into applying statistical tools to software engineering artifacts, like code, tests, and social networks. The premiere venue for this is the International Working Conference on Mining Software Repositories (MSR).
As an example, myself and other researchers have been trying to automatically extract requirements from project artifacts, including emails, issue trackers, and code commit messages. You can read more about my work here, and it was recently extended by Abram Hindle here. Another interesting tack was taken by Radu Vlas at Georgia State University, using ontologies and part of speech tagging.
The trouble is all of the techniques we’ve used to date only give us a precision/recall of at best 60/60 , meaning we miss 40% of the actual requirements and only 60% of the results are in fact requirements. This isn’t very satisfactory, because it means you miss out on possibly important requirements, and still have to wade through a lot of noise. We’d like to get this to 100/100, of course, but anything would be an improvement. The benefits would be immense: requirements traceability would be simplified, allowing us to talk about whether a requirement was implemented, what requirements interact, describe the current requirements a project satisfies, and many others.
Unfortunately, I wonder if we haven’t hit on the fundamental limit to the natural language parsing/machine learning toolkits (in this domain). I say this because the most obvious successes of machine learning, such as spell-checking or flu tracking, are due at least as much to the vast amount of data involved as to the novelty of the techniques themselves. The problem is that in software projects, the amount of information is measured in the tens of thousands of data points, when it really should be millions or tens of millions to be really successful.
For example, one of the longest-lived open projects, Mozilla, only has about 800,000 issues in its issue-tracker. Which is not bad, but we need to remove duplicates and bug-reports to find the ‘new features’ or ‘requirements’ – making it more like a few thousand. And it seems unlikely that commercial sources would have orders of magnitude more to offer. This is like training a spell-checker using a corpus of ten thousand English sentences—it will perform terribly. For example, on the Google Flu Trends link above, you can see that for the Canadian provinces/territories of PEI, NWT and Nunavut, there is not enough data (due to small populations) to make any predictions.
Getting more data isn’t obvious, either. We might aggregate the projects together, but each individual project tends to be very different in how they use language, so essentially it is like configuring ten different languages for your spell-checker each with ten thousand sentences. Impossible!
So assuming automating requirements extraction from project data is useful, how might we do this? I think there are two approaches. One, we downgrade our definition of ‘automation’ to allow for human judgement. This might mean asking developers to define a common project lexicon, or to be more diligent about annotating requirements automatically (many projects already separate these). However, asking developers to do things other than their core activities (writing code!) is usually doomed to failure unless it is very painless.
I think the other approach is to move past the bag of words model. In most statistical learners, you throw documents or sentences into a huge corpus. This works great for standard information retrieval examples like the Reuters corpus. But in software projects, it feels like by doing this we are losing a lot of the metadata, like dates or people, that might be relevant. Perhaps if we somehow annotate the training data with this information, and feed that into a learner, we would have more success.
- I may be ball-parking these numbers! ↩