The Trouble with Data Mining in Software Engineering

A lot of work has gone into applying statistical tools to software engineering artifacts, like code, tests, and social networks. The premiere venue for this is the International Working Conference on Mining Software Repositories (MSR).

As an example, myself and other researchers have been trying to automatically extract requirements from project artifacts, including emails, issue trackers, and code commit messages. You can read more about my work here, and it was recently extended by Abram Hindle here. Another interesting tack was taken by Radu Vlas at Georgia State University, using ontologies and part of speech tagging.

The trouble is all of the techniques we’ve used to date only give us a precision/recall of at best 60/60 [1], meaning we miss 40% of the actual requirements and only 60% of the results are in fact requirements. This isn’t very satisfactory, because it means you miss out on possibly important requirements, and still have to wade through a lot of noise. We’d like to get this to 100/100, of course, but anything would be an improvement. The benefits would be immense: requirements traceability would be simplified, allowing us to talk about whether a requirement was implemented, what requirements interact, describe the current requirements a project satisfies, and many others.

Unfortunately, I wonder if we haven’t hit on the fundamental limit to the natural language parsing/machine learning toolkits (in this domain). I say this because the most obvious successes of machine learning, such as spell-checking or flu tracking, are due at least as much to the vast amount of data involved as to the novelty of the techniques themselves. The problem is that in software projects, the amount of information is measured in the tens of thousands of data points, when it really should be millions or tens of millions to be really successful.

For example, one of the longest-lived open projects, Mozilla, only has about 800,000 issues in its issue-tracker. Which is not bad, but we need to remove duplicates and bug-reports to find the ‘new features’ or ‘requirements’ – making it more like a few thousand. And it seems unlikely that commercial sources would have orders of magnitude more to offer. This is like training a spell-checker using a corpus of ten thousand English sentences—it will perform terribly. For example, on the Google Flu Trends link above, you can see that for the Canadian provinces/territories of PEI, NWT and Nunavut, there is not enough data (due to small populations) to make any predictions.

Getting more data isn’t obvious, either. We might aggregate the projects together, but each individual project tends to be very different in how they use language, so essentially it is like configuring ten different languages for your spell-checker each with ten thousand sentences. Impossible!

So assuming automating requirements extraction from project data is useful, how might we do this? I think there are two approaches. One, we downgrade our definition of ‘automation’ to allow for human judgement. This might mean asking developers to define a common project lexicon, or to be more diligent about annotating requirements automatically (many projects already separate these). However, asking developers to do things other than their core activities (writing code!) is usually doomed to failure unless it is very painless.

I think the other approach is to move past the bag of words model. In most statistical learners, you throw documents or sentences into a huge corpus. This works great for standard information retrieval examples like the Reuters corpus. But in software projects, it feels like by doing this we are losing a lot of the metadata, like dates or people, that might be relevant. Perhaps if we somehow annotate the training data with this information, and feed that into a learner, we would have more success.


  1. I may be ball-parking these numbers!  ↩

RE-KOMBINE: Paraconsistent Reasoning for Requirements

Here I try to summarize the work and motivation for my dissertation, or parts thereof. One of the bigger components of my thesis is RE-KOMBINE, which is a reasoning engine for requirements models. You can find the code and examples here.

A central problem in requirements engineering (RE) is to understand what to build next. Requirements models are representations of what your system needs to accomplish. They encompass both that which exists, the current payroll system, for example, and what needs to be created—the interface between the payroll system and the point-of-sale system, for example. Like any model, requirements models are imperfect copies of the real world, with the degree/extent of imperfection reflecting cognitive support characteristics. For example, some of the most popular requirements models are informal lists, kept on paper or in a spreadsheet.

My dissertation research was about better understanding what a good model of evolving requirements ought to do. And I’ve come to the conclusion (somewhat diametrically opposed to my beliefs at the start of my Ph.D.) that formality is not only NOT harmful, it is under-utilized.

Formality considered essential

One’s representation of a problem is either informal or formal. I don’t understand people who say they are ‘semi-formal’. That’s like being a little pregnant. Formality means very specific things in this domain. Furthermore, if we want to do anything with a machine like a computer, we cannot give it inconsistent instructions. Most of the machines we’ve built do very badly with inconsistent information (although we are working on that). So even though you may claim to have a ‘semi-formal’ language, at some point it will assuredly be formal. It’s just that the process of translation may be found at a different point in the chain. E.g.

problem domain → req. elicitation → informal model → Rational requisite pro/Blueprint → knowledge/software engineer (makes many decisions that were ‘implicit’ → formal representation (code, typically)

VS.

problem domain → elicitation → formal model → translation engine (human/automated) → source code

But what do we mean by ‘formal’?

Opponents of formalization often characterize it as unwieldy, non-scalable, confusing for end-users, etc. Is it difficult for humans to represent things in a formal system? Sure.
But as this book says, “the power of formalization is that, once formalized, an area of interest can be worked in without understanding” (Reeves, S., & Clarke, M. (2003). Logic for computer science. Addison Wesley. Retrieved from http://www.cs.waikato.ac.nz/~stever/LCS.html).

This makes formalization particularly important for sharing with others, or for using computer programs. Is reading a formalization challenging? Yes. There is much abuse of notation in many formal research papers, and even worse, the demand formality places on clarity of presentation necessitates an unwieldy presentation (you cannot just wave your hands and ‘claim’ something is true). But that is also the beauty of it – once you understand the formalization (for example, propositional logic), it becomes impossible (or rather, very difficult) to hide wooly thinking. Is formalization subject to scalability challenges? Sure. If a problem is inherently NP, formalization won’t remove that. But neither will being informal. I actually think titles like “formal methods” or “formalization” are not helpful. They have so much baggage that the terms have become pejorative. Also, if you are a researcher in software, and don’t understand propositional logic, lambda calculi, complexity theory, etc., shame on you. Even if you work on human-computer interaction, these topics will make you a better researcher.

Techne (τέχνη)

We have been working on another requirements modeling language, Techne (see this paper). Why another language? We have plenty of requirements languages: KAOS, Problem Frames, i*, Tropos, UML (sort of) … In Techne we tried to write a language that was minimal and captured the essence of the requirements problem: given your requirements, find the implementation choices that will satisfy them (originally proposed by Zave and Jackson. We are working on ways in which Techne can be extended with concepts for more expressive models, using actors, numeric weights, and so on. But the initial language is merely requirements, tasks, and constraints (domain assumptions).

Requirements and tasks (nodes) are encoded as propositions, and the only formalization necessary is the relationships between propositions. These come in two flavours: implications, which capture the notion that (for instance) implementing task T will satisfy requirement R. We didn’t want to have propagation of ‘partial’ anything, like other languages, because it is unclear what partial satisfaction means. So Techne allows for conflict relationships to represent the situation where doing one thing means another cannot be done, which we write as D ∧ E → ⊥. (Read as satisfying requirements D and E will be inconsistent).

Why is this stuff useful? Techne makes two main contributions. One is to present a formal requirements modeling language with very well-understood proof theories and theorem provers (using Horn logic). It is decidable and polynomial to find inconsistencies (not the case in full SAT). The second contribution is the addition of the following notions: mandatory and optional states for nodes; preferences between minimal solutions in a set of requirements; and approximations of softgoals using quality constraints.

Node labels

Typically in requirements models we want to associate with each element a label reflecting the satisfiability of that node (hence the use of SAT solvers). We might monkey around with this notion, for example by using four-valued logic to handle ‘partial satisfaction’ (e.g. Sebastiani et al.).

Some nodes, certainly domain assumptions, and possibly tasks we have chosen, will start with labels (let’s call them T or F for now, although the mapping from these letters to the notions of Truth or Falsehood is not straightforward). The outcome of ‘evaluating’ a goal model (or other requirements model) is a labeled model, which tells the analyst which elements can be satisfied (hopefully the high level requirements!). We can also call this a ‘solution’ to the requirements problem – it tells us which elements will satisfy our optative properties.

Types of reasoning

I’m primarily considering logical reasoning; there are other pseudo-logical algorithms available. I am trying to collect these algorithms on GitHub. I’d love pull requests.

All requirements languages rely on consistent models. That is, if an inconsistency is found (that bottom, ⊥, can be derived), the entire model is trivialized; the inconsistency must be removed.

Two main approaches are forward and backward reasoning. In forward reasoning we start with a set of ‘facts’, and try to determine what ‘goals’ we can fulfill. Expert systems work like this, too. Considerations: can you support cyclic graphs? Does the algorithm terminate? Is the algorithm scalable?

Backward reasoning starts with a goal, or goals, and works to find the facts that make that rule true. This is how Prolog works: resolution proofs. I’m not aware of any requirements languages that support backward chaining; datalog might be an example though, and if we generalize requirements models as KR systems, there is a lot of work here.

A final way to reason about goal models is to try to ‘label’ the graphs consistently. I don’t hold this to be either backward or forward reasoning: instead, you are just brute-forcing the problem into a set of conjunctive normal form (CNF) formulas, and then trying to satisfy the resulting wff. This problem is NP-complete, but SAT solvers have advanced to the point where most requirements problems should be readily solvable. The nice thing about the CNF representation is that there are a variety of twists on the boolean satisfiability problem, such as WeightedSAT, MaxSat, MinCostSat, etc.

RE-KOMBINE

To extend these approaches, we noted that it is often desirable to support paraconsistency, that is, tolerating inconsistency. There are at least four reasons for allowing inconsistent statements and working around them (after Nuseibeh et al.):

  • to facilitate distributed collaborative working,
  • to prevent premature commitment to design decisions,
  • to ensure all stakeholder views are taken into account,
  • to focus attention on problem areas [of the specification].

RE-KOMBINE is the name of the tool I wrote to support paraconsistent reasoning over Techne models. You can view a presentation which summarizes it on Slideshare, or read the CAiSE 2012 paper for more in-depth discussion.

If we continue with the Techne notion that we should have propositions which are either requirements, tasks, or domain assumptions, and then allow for refinements and conflict relations between them, then our paraconsistent approach simply says that we credulously accept minimal solutions which entail the desired requirements. That is, we are merely looking for a subset of the tasks which satisfy our requirements.

This is nice, because it means that even if there is a possible conflict between two requirements, or between a domain assumption (like “Don’t Use WEP”) and a requirement (“Use WIFI for remote terminals”) we can ignore that conflict as long as there is a ‘workaround’ solution. We like this, because it means we can be more flexible (agile) by looking for the immediately implementable solution, and worry later about how we might actually make the conflict disappear.

The only constraints we impose is that our domain assumptions are internally consistent, and that the requirements we are seeking to satisfy are also consistent with each other (if this isn’t the case, then presumably the operator is confused).

We used RE-KOMBINE on the Payment Card Industry case study (PCI-DSS) as a proof-of-concept. Our next focus is to make this tool integrate with existing requirements management and work tracking tools, in order to seamlessly fit into existing workflows.

It’s possible this post was too long.

ICSE 2012 Thoughts (1): Saskia Sassen Keynote

I attended ICSE 2012 in Zürich early June. I have a few notes I’ll share in dribs and drabs over the next little while. (ed: see also Eric Knauss’s summary).

Saskia Sassen gave the opening keynote, which in my mind should set the scene for the conference as a whole, coming as it does on the very first day. I don’t think it was well-received; most people found her talk obscurantist, heavily draped in the overwrought language of the liberal arts. Which is sad: having some exposure to that way of writing and thinking (thanks Geography 101: Human Geography!) I could follow with some effort (but pity those who don’t speak English natively). Consider: she used the term “softwaring” as a verb. The other critique I direct her way is that she committed the intellectually lazy sin of assuming her audience is all uniformly logical positivists (I gather because of the “engineering” term in the conference title). Those folks are definitely a majority, but a substantial number of researchers at ICSE are exploring the human side of software engineering, and are comfortable discussing different epistemologies and methodologies.

Sassen has some interesting experiences and ideas underneath her external posturing, and I think they are quite relevant to software engineering. One thing I picked out was her insistence that not everyone was ‘online’ or even wanted to be online, and we ignore them at our peril. And rather than wiring a community and stepping back to see what happens, why not see what is important first? She made reference to a phenomenon she called “barefoot engineers”, people who, post-Communism, set up rudimentary technologies like utilities outside of the traditional structures. She made the point that competence rules in the tech world, but that context of use is equally important.

I think this has clear parallels in the ethnography of software development: we tend to focus on the software developers, and I think (aside from some researchers like Bonnie Nardi, Susan Sim, my friend Jorge Aranda), ignore the contexts and communities of use of those technologies. Who are the facilitators? Who translates the knowledge? Does someone like RMS or Linus really operate in complete isolation? Wherefore all these data mining tools, anyway?

She described some of her research initiatives, too: one where 3 researchers were killed in the Amazon (I think, my memory is hazy). Which raises the question of impact of SE: not that we want people to die doing it, but presumably those deaths are for some good and noble purpose. Can we say the same about SE research?

Once you got past the language barrier, I think the keynote was excellent: she made one question the nature of one’s work, which I think is what you ask for in a keynote. A pity no one is able to force her to do a dry-run first, though.

Using iCloud and Cisco VPN

UBC uses Cisco‘s VPN solution for accessing the network off-campus [1]. For a long time I skipped using it because 1) it conflicted with iCloud and 2) it required a constantly active app icon in the Dock. Well, 1) is partially resolved, so I can again use it, begrudgingly.

The problem was that if you had iCloud active (using the preference pane), when the Cisco “myVPN” client started it would not find the VPN server. It turns out that this is due to only two iCloud services, Back to My Mac and Find My Mac, which you can disable. Once those are unchecked (see image), the Cisco client should be able to detect the server and get online.

Image

Since I only use iCloud for data and calendaring right now, this is acceptable, although I suppose losing my laptop might make me change my mind. You can check out this Apple support discussion for more lurid details.

1. Actually, you need to VPN if you are accessing printers or local directories on-campus as well.

Cynefin and MDE

Watching Dave Snowden talk about Cynefin (thanks to @dsg22 for the reminder), it occurred to me that this sense-making approach might explain some of my discomfort with model-driven engineering (MDE). I will readily admit to bias: we built a research project, OpenOME,  using model-driven engineering with GMF, and in my opinion, it was completely the wrong approach for the job (keeping in mind that MDE pays off in subsequent projects).

I think my conceptual problem is that MDE encodes a specific process in a tool. At least one central selling point is that the models can be independent of the context in which they are used. That is, we have this process for desiging graphical editors in Eclipse (with a hard-coded architectural approach), and we have encoded this into the GMF framework.

Cynefin framework (via Wikipedia)

It isn’t that this can never be useful. But if we turn to Cynefin, we can see why it might be problematic. I would characterized MDE as working well in “complicated” or “simple” domains, where the appropriate response to a problem is to use best practices or ‘good’ practices – exactly what MDE is doing. However, much of the work we do in software development is in the complex, if not chaotic domains, where best practices simply don’t exist. I would characterize my experience with GMF this way. The tool we built had unclear requirements, and we didn’t understand the technological landscape properly. A better approach, if we go by Cynefin, would be to “probe” the problem by building prototypes, and then determine what our response to the experiement would be.

In other words, if you are in a simple or complicated domain, then MDE tools should work great. They will allow you to deploy your app to multiple platforms, greatly reduce re-engineering effort, and much else. But if you don’t understand the domian very well, or you are in a chaotic system, you should put the MDE aside. And it’s my opinion that most software development occurs in the complex domain.

Using GitHub for 3rd Year Software Engineering

This past semester (Winter 2012), I was the instructor for UBC’s CPSC 310: Introduction to Software Engineering. As part of the course, students must complete a large-scale software project in teams of 4–5 in 2 months. This term, I allowed some teams to use GitHub to manage the project.

Reasons

At UBC, we have for some time used IBM’s Rational Team Concert tool, which is free for academic use, as our software collaboration environment. This was the default tool for this term, as well, save for the three groups who applied to use GitHub. The University of Victoria has been using RTC for a similar purpose.

RTC is, I’m sure, an excellent product for its intended audience, namely, professional software development teams. It is easy to install and maintain for the technical support staff here, has sufficient documentation, and clients for Mac, Linux and Windows. It is also free for us to use as part of the IBM Academic program. However, in course evaluations, it has been the single most complained-about part of the course. It is cumbersome to install for students, and most importantly, always seems to be out-of-date with other software. In our case, RTC 3 is built using Eclipse 3.4, which is “somewhat” incompatible with the libraries and plugins I was looking to use, chiefly the Google Plugin for Eclipse. A significant amount of course time (TA and instructor office hours) was spent getting RTC to work with the other software for the course. And I am just seeing that IBM is now working on RTC 4, which implies more trouble in the future.

Now, this is partly because students have not had much experience installing commercial development tools on their machines, and that is certainly a learning objective for this course (I am confident it is a pain point for nearly all software developers, new or otherwise). It is also because students run a bewildering array of operating systems and language profiles on their laptops and desktop computers, which makes support a headache.

That being said, my sense was that RTC was simply too much tool for what the students needed. As Greg Wilson’s DrProject experiment showed, students simply do not have time, nor inclination, to leverage the more powerful collaboration aspects. Filling out work items, creating documentation, even committing code is something they just do not see the need for. To get them to try it, we must assign marks to those activities. RTC’s terminology (streams, components, etc) is probably great for a developer with multiple projects: for these students (and me!) it is non-standard with most version control concepts and confusing to use.

Since I’ve personally used GitHub for a while, and git seems to have a lot of developer mindshare, it seemed like a good fit for an experiment.

Experiences

Instructor

I emailed Github about the use of their web app for education and received a very prompt affirmative. Github will provide an organization account to the instructor, which includes private repositories for up to 200 people. At the end of the semester, they then require you to either delete the repositories (on Github, obviously not locally) or make them public (free accounts).

The Github UI is generally simple, but some of the navigation options are confusing from the team manager point of view (me!). Tracking student performance is pretty easy, since you have ready access to the excellent Github website, including issue tracking, change set tracking, pull requests, etc. Github has excellent graphs that allow the instructor team to check who is doing what. There doesn’t appear to be a good way to email all of the students in the various teams at the same time (we are an “organization” made up of 4–5 person teams).

Git itself is the main reason to use Github. There are vastly superior tutorials on it, and I like the pure distributed model better than what RTC provides. Finally, and perhaps most important, as the instructor I’ve used Git a fair bit and RTC very little. The disadvantages are that there is no central repository for backup purposes, although being distributed this is presumably less important.

Student

Students were generally positive. The alternative, in this course, was to use RTC. Git has the advantage of more widespread adoption (you’re unlikely to use RTC at Microsoft, but MSFT supports Git in various places). And of course, if Github goes down, then students can no longer manage issues. Git itself is complex to learn; I should have provided a short tutorial to those teams on the basics of Git.

They also found tools like SourceTree and Tortoise invaluable in understanding what was happening with branches and remotes. For a while, a few teams had multiple, non-merged and conflicting branches for each member, which they could resolve once they saw visually how the branches were happening. The concept of remote repositories and pull requests is a little alien at first.

Issue tracking in Github is primitive relative to RTC. This is a strength for this course, I feel, but when we go from user stories to tasks it means students had to roll their own classification scheme (e.g., define a product backlog item, then the tasks which compose it).

The teams which used Github were much stronger than most other teams in the course, so the results are no doubt skewed. That being said, I don’t think RTC was any simpler to use—in a number of teams, at least one team member never managed to commit code to the shared repository.

Looking ahead

The obvious question is, “Would you use Github again”? The answer is yes, and perhaps even “I would like to make everyone use it.”

It was confusing to have two separate tools in the course. Partly, this is because marking is complicated by the fact student teams are in tutorial sections, so some teams in a given tutorial were using Github and some RTC. This meant TAs had to mark both tools (and learn both). Exam questions are more complicated, since you must account for some students never having used RTC, if you ask about issue tracking.

I like the fact that the project collaboration tool was separate from the IDE. I think RTC’s tight Eclipse integration makes it difficult to install the IDE necessary. Some students ran Eclipse 3.7 in conjunction with RTC (Eclipse 3.4) in order to get plugins working. Since git is so popular, it is much easier to find tool support for it than to munge RTC into your work flow. In future, tools like Mylyn would be useful to better integrate issue tracking into the IDE.

The big outstanding issue is privacy. In BC, the provincial government is considering laws that prohibit (or seem to, IANAL) student data being anywhere near a US server (despite students happily sending email about their marks from Hotmail or Gmail). While I respect that motivation, I feel there should be some way to give consent, particularly when so many excellent tools are US-based.

DeMarco and “Cannot control what you cannot measure”

Tom DeMarco’s influential 1982 book “Controlling Software Projects: Management, Measurement, and Estimation” famously started with the aphorism that you “cannot control what you cannot measure”, in arguing for more metrics in software development. There is pushback against this notion, however, and DeMarco himself (pdf) has said he thinks it was written naively: “a curious combination of generally true things written on every page but combined into an overall message that’s wrong”.

The critique against management as measurement (e.g., Jorge Aranda‘s) stems from a belief that measurement implies a Taylorist approach to software engineering that is simply not applicable. Software construction is not widget assembly, and even in industrial assembly lines the notion of human workers as cogs has changed dramatically, most obviously as encapsulated by the set of philosophies called the “Toyota Way”, and now commonly called “lean manufacturing”.

The idea that a software team lead could sit in a corner office, enveloped by dashboards showing lines of code added per day, bugs fixed, etc. does seem silly. Clearly, a better approach would be to understand the team dynamics and their struggles – the role that the Scrum Master plays, for example. However, this isn’t to say that metrics aren’t supremely important. The lessons of big data and Google-style machine learning is that there are often messages hidden in the data that cannot be surfaced with individual, anecdotal experience. Consider a simple indicator like “lead time” (the time from a work item being created/accepted to it being closed/released). It can be very difficult to get a good sense for this in the aggregate, since developers spend so much time on individual items. It is only looking back on a cumulative flow diagram that spikes in lead time can be detected, and bottlenecks resolved.

I think the difference from Taylorism is that modern metrics are seen as complementary to hands-on, qualitative management, or pieces of  a software process improvement approach like Kanban. In Kanban, two key areas where this occurs are the principles of “Visualizing workflow”, often using a whiteboard with swim lanes, and “Managing Flow”, which uses lead-time and work in progress measures. The key point, though, is that these are two of the five core principles, the others being “limit work in progress”, “make process policies explicit”, and “improve collaboratively”. You don’t manage on metrics alone 1. The metrics drive the self-improvement process but cannot replace in-depth understanding. In short delivery cycles like Scrum or XP advocate, this is much easier to do, because the ability to tweak the process occurs much more rapidly than in longer-iteration approaches, as Jim Highsmith pointed out.

1. Let us stipulate that there are endless examples of low-maturity teams out there whom no technique will help.