The Marginal Utility of Testing/Refactoring/Thinking


January 21, 2016

Andy Zaidman had an interesting presentation about test analytics. The takeaway for me was that a) people overestimate their unit test engineering (estimate: 50%, reality, 25%). But b) the real issue is convincing a developer that this unit test will improve the quality of the code. In other words, like with technical debt, or refactoring, or commenting, the marginal utility of adding a test is perceived to be low (and of course the cost is seen as high). Each new individual test adds nothing to the immediate benefit (with some exceptions if one is following strict TDD). And yet each one requires switching from the mental model of the program to the one of Junit frameworks and test harnesses.

The issue is not whether testing is good or bad, but rather, which testing is most useful. It seems unlikely to me that the value of individual tests is normally distributed but rather power-law form (i.e., that there are a very few extremely high value tests). And this isn’t just about testing; indeed, most activities with delayed payoff—refactoring, documenting, architecting—likely exhibit the same problem. It is hard to convince people to invest in such activities without giving them concrete proof it is valuable. You just have to look at the default examples for Cucumber, for instance, to see that the vast majority are trivial and easily grasped without any of the tests. Similarly, “code smells are bad”, but bad might just mean they look nasty, while having little to do with the underlying effectiveness of the code. It isn’t technical debt if it never causes a problem. It isn’t a bug if it isn’t worth fixing it.

In new work we are starting with Tim Menzies, we are trying to understand the inflection point beyond which your decisions add little incremental value (i.e., stop adding more tests). The good news is this is easy to spot in hindsight; the challenge is to take those lessons and determine this before doing hours of pointless work. The direction we are taking is to try and capture the common patterns the key decisions share (in the testing example, perhaps this is bounds testing). Ultimately, we hope to provide advice to developers as to when the marginal utility falls below a threshold (i.e., stop testing!)

The other point is the over-reliance of software engineering on hoary folklore. Things like “some developers are 10x as productive”, or “80% of bugs occur in requirements”, tend to be statements that are derived from a single study, conducted in 1985, on 3 large scale defense projects, but have somehow made their way down the years to become canon. Ours is not the only field to suffer from this, of course. But when capable developers refuse to pay 200$ a year to join the IEEE Digital Library, it seems to demonstrate a firm commitment to ignorance.