The Marginal Utility of Testing/Refactoring/Thinking

Andy Zaidman had an interesting presentation about test analytics. The takeaway for me was that a) people overestimate their unit test engineering (estimate: 50%, reality, 25%). But b) the real issue is convincing a developer that this unit test will improve the quality of the code. In other words, like with technical debt, or refactoring, or commenting, the marginal utility of adding a test is perceived to be low (and of course the cost is seen as high). Each new individual test adds nothing to the immediate benefit (with some exceptions if one is following strict TDD). And yet each one requires switching from the mental model of the program to the one of Junit frameworks and test harnesses.

The issue is not whether testing is good or bad, but rather, which testing is most useful. It seems unlikely to me that the value of individual tests is normally distributed but rather power-law form (i.e., that there are a very few extremely high value tests). And this isn’t just about testing; indeed, most activities with delayed payoff—refactoring, documenting, architecting—likely exhibit the same problem. It is hard to convince people to invest in such activities without giving them concrete proof it is valuable. You just have to look at the default examples for Cucumber, for instance, to see that the vast majority are trivial and easily grasped without any of the tests. Similarly, “code smells are bad”, but bad might just mean they look nasty, while having little to do with the underlying effectiveness of the code. It isn’t technical debt if it never causes a problem. It isn’t a bug if it isn’t worth fixing it.

In new work we are starting with Tim Menzies, we are trying to understand the inflection point beyond which your decisions add little incremental value (i.e., stop adding more tests). The good news is this is easy to spot in hindsight; the challenge is to take those lessons and determine this before doing hours of pointless work. The direction we are taking is to try and capture the common patterns the key decisions share (in the testing example, perhaps this is bounds testing). Ultimately, we hope to provide advice to developers as to when the marginal utility falls below a threshold (i.e., stop testing!)

The other point is the over-reliance of software engineering on hoary folklore. Things like “some developers are 10x as productive”, or “80% of bugs occur in requirements”, tend to be statements that are derived from a single study, conducted in 1985, on 3 large scale defense projects, but have somehow made their way down the years to become canon. Ours is not the only field to suffer from this, of course. But when capable developers refuse to pay 200$ a year to join the IEEE Digital Library, it seems to demonstrate a firm commitment to ignorance.

2 Comments

Filed under Uncategorized

A Model of Software Quality Checks

Software quality can be automatically checked by tools like SonarQube, CAST, FindBugs, Coverity, etc. But often these tools encompass several different classes of checks on quality. I propose the following hierarchy to organize these rules.

Level 0: Syntax quality

Focus: code that ‘runs’.

Level 0 means a compiler or interpreter’s components (parsers, lexers, intermediate forms) assess syntax correctness. Level 0 because (clearly) without proper syntax nothing is getting done.

Level 1: Lint-free

Focus: Code that respects obvious sources of problem.

No warnings occur if all possible flags are turned on in the compiler. These warnings tend to be close to syntax in their complexity. For example, technically a fall through switch statement is possible in Java, but there is the -Xlint:fallthrough tag to catch this. Often IDEs such as Eclipse will flag these automatically with warning icons.

Level 2: Good code

Focus: Code conforms to commonly accepted best practices for that language.

E.g., for Java, visibility modifiers are suitable, in C, no buffer overflows, memory is released appropriately. Some cross-language practices apply: documentation, unit tests exist, and so on. Many of the quality analysis tools like FIndBugs operate at this level. CWEs are another example. I also place dependency analysis approaches here (perhaps controversially). It also pops up in the next level (e.g., properly using interfaces in Java).

Level 3: Paradigmatic

Focus: writing code that is maintainable, understandable, and performant with respect to its runtime environment.

Would someone writing object-oriented, functional, embedded, etc. code consider this reasonable? Includes principles like SOLID, functional side effects, memory management, distributed code demonstrates awareness of fundamentals of distributed computing. Also includes proper use of language idioms e.g. proper use of Javascript callbacks, Ruby blocks, etc. We might also classify new language features here — the use of generics in Java 7 comes to mind. Essentially, if you did a peer review with a language guru (Odersky for Scala, say), would they have a ‘better way’ to do it? (Perl notwithstanding…)

Level 4: Well-designed

Focus: building systems that respect appropriate (known at the time) usage scenarios.

Given the knowledge available, the code is architecturally appropriate for the quality attribute requirement (QAR) applicable. E.g., modular, performant, secure. The key here is understanding the relevant QARs. Examples include reflexion models (like ArchJava), conformance checking (e.g. Dicto), library analysis (e.g., for license issues, for currency).

Outcome

A few things become clearer when we view software quality with this approach.

First, I think that quality checks become more useful as you move ‘up’ (0→4) in the hierarchy. That is, I’d rather know that I have a serious design problem than a code quality problem.

Second, unfortunately, it seems much harder to design truly automated checks at the higher levels. This is why we have a lot of manual architecture analysis but leave code quality to tools.

Third, our rules get more context-specific as we move up the hierarchy. I.e., in order to properly check paradigmaticness1, I need to know your choice of programming language and possibly your problem domain properties. To properly do design validation, I need to know what qualities are important to you: performance? availability? That, I think, is partly what makes these levels more useful.

Other hierarchies

The one I’m most familiar with is from Jean-Louis Letouzey. He proposed the SQALE quality model, and his central insight is that some qualities precede others: you must have maintainable code before having performant code, or testable code before secure code.

EDIT [1/6/16]: somehow I forgot this CAST diagram showing different levels of analysis, very similar to mine. They also claim that the ‘system level’ (my design level) is the place where architecture is checked.


  1. I’m not sure how to ‘noun’ this adjective … 

Leave a comment

Filed under Uncategorized

Requirements, Agile, and Finding Errors

It’s a long held view in the requirements engineering (RE) community that “if only we could do RE better, software development would be cheaper”. Here ‘doing RE better’ means that your requirements document adheres to some quality standard such as IEEE 830. For example, none of the requirements are ambiguous.

One justification is that, based on (very few) studies in the late 80s, requirements errors cost a lot more to fix in test/production than when they are incurred. For instance, if I tell a subcontractor she has a 100 kilobyte message size limit, and I really meant 100 kilobits, fixing that problem after she has delivered the subcomponent will be expensive. This seems obvious. But two problems emerge. 1) Why does she have to wait so long to integrate the subcomponent? 2) how many of these problems are there? Granted it is cheaper to fix that particular error in the requirements/system engineering phase, how much money should we spend to find these errors at that point? [1]

An interesting early experiment on this is described in Davis, 1989, “Identification of errors in software requirements through use of automated requirements tools”, Information and Software Technology 31(9) p472–476. In an example of an experiment we see very rarely these days, his team were given sufficient funds to have three automated requirements quality tools applied to a large software requirements specification for the US Army (200,000 pages!). The tools were able to find several hundred errors in the spec, including errors of inconsistency. Yay, the tools worked! But….

The program had decided to go ahead and build their (Cobol) system before the automated analysis. The developers on the program didn’t care much about the findings. 80 of the 220 modules were not detectable in the final system (meaning, presumably, they were either merged or omitted altogether). Davis did some post-delivery follow-up, showing that the modules with greater numbers of requirements problems had a significantly greater number of post-release defects. But whether the two are causally related is hard to say (those modules may simply be more complex in general, so both requirements and code are harder to get right).

What I conclude from this is that finding errors of the sort they did, e.g.,

PROBLEM: the referenced table directs that PART_NO be moved from the WORK_ORDER_FILE to the WORK_TASK_FILE. Available fields in the WORK_TASK_FILE include PART_NO_FiELD_PART and PART_NO_FIELD_TASK.

CHOICE: We assume that PART NO FIELD_TASK is the proper destination.

are ultimately of zero value to document. As a result, finding problems with them, automated or otherwise, is also of no value. Of course we know all this from the past 20 years of the agile movement, but it is interesting to see it in action. I think that (in 1989 certainly) this was excusable, as the program managers had no good sense of what made software special. The level of detail the design describes, down to field names and dependencies, is better suited to the Apollo program, where they prescribe how tightly to turn bolts, label each individual bolt, etc. Which makes sense in a safety critical dynamic environment, but not a lot of sense in an office logistics tool.

Going Forward

A term I loathe but seems better than “Future Work”. I’ve worked a lot on automated requirements tools like PSL/PSA or SREM, so where should we head with automated tooling for requirements?

There is a lot of empirical evidence that simple, easily integrated process patterns such as requirements goals and scenarios lead to higher quality requirements. Intel, for example, are strong believers in training staff in writing good requirements (although notice their domain is also hardware-oriented and mistakes are costly). Even in agile settings I believe there are big improvements to be gained in writing better user stories (e.g., how to create the “Magic Backlog” described in Rebecca Wirfs-Brock’s EuroPLoP 2015 paper).

Furthermore, we are seeing more and more use of machine learning to flag requirements problems. For example, at Daimler they have simple detectors for checking requirements. And at Rolls-Royce, based on simple training exercises, they label requirements based on potential risk, combining uncertainty, change impact and cost into an index. All of these types of tools integrate will into a developer analytics approach, able to populate dashboards and flag things unobtrusively (compared with the cost of writing requirements formally).

Like with any analytics techniques, which ones to apply is situation-specific. Small companies doing the same things in well-understood domains won’t need much, if any requirements analysis. I think there is a lot of room for intelligent augmentation of what makes a good requirement, that facilities conversations and discovery of uncertainty, that automates the repeated and boring tasks (if you cannot possibly avoid creating a 2000 page document …). And in specialized domains, we are moving to a world where more and more of the analysis can be done in models, to verify timing requirements, guarantee that software partitions hold, and so on. Here the line between ‘requirement’ and ‘design solution’ is blurry, because requirements at one level become design solutions at the next level. A mature requirements practice would leverage this to enable experimentation and prototyping in silico, as it were, finding design problems before releasing products or fabricating chips.

Finding Defect Leakage

A major goal for large programs is to reduce defect leakage, the number of bugs that make it to production (to put it more precisely, reduce the number of critical bugs that make it to production). It seems to me there are at least four complementary approaches to this issue:

  • We could do this manually, and insist on writing good requirements using checklists, training, inspection, etc.
  • We could use formal methods, on well-formed architectural models, looking for very specific rule violations (safety, security, performance);
  • We could apply machine learning tools on past artifacts and try to leverage experience to predict problems. Not every requirement is equally important (obvious but not always followed).
  • We could design a process that accepted the inevitability of change and made it not only possible, but desirable to change design and requirements in response to new knowledge.

For the automated tools, I have this quick list of principles. Much like software analytics in general:

  1. Don’t make life worse. Developers should not dread having to do this. That said, an ounce of pain is worth a pound of pleasure.
  2. Work with existing tools like Doors, Jira and Excel. Your Eclipse plugin does not count.
  3. Don’t mandate new or complex languages or tools for requirements. We can barely get engineers to write requirements in natural language as it is.
  4. Prefer lightweight, high value checks over complex, theoretically appealing ones. Socialize people to the value of checking anything before insisting on the complex stuff.
  5. Integrate with existing dashboards like Shipshape or SonarQube. These tools have good plugin frameworks and already integrate with many build and CI servers.
  6. Facilitate conversations and early delivery of results. Remember that requirements engineering is the start of a conversation that gets us to a valuable solution. It is never an end in itself. In very few domains does assuming requirements won’t change get you anywhere.

  1. And Basili and Weiss’s 1981 study on the A7 program’s change requests and requirements suggest a power-law distribution to the most costly (e.g., > 1 person-month of effort) changes.  ↩

2 Comments

Filed under Uncategorized

How Writing Code is Like Making Steel

I saw an interesting keynote from Mark Harman recently, on search-based software improvement. Mark’s lab at UCL also pioneered this idea of automatic code transplants using optimization techniques.

I think if you are an engineer who does fairly standard software development you should be concerned. The ultimate vision is to be able to take some specification with thorough tests, written in a language at a high-level of abstraction (e.g., here is my corporate color palette, here are my security requirements) and automatically generate the application.

There are several forces at play here. One is the increasing componentization of large and complex pieces of software. We’ve always had software reuse, but it tended to be at a much smaller level – the ODBC api, or the OAuth framework. Now our frameworks reach much larger areas of concern, particularly when we look at container technology running on commodity hardware. Someone else is maintaining huge chunks of your software base, in those cases: the OS, the backend, the messaging system, etc. If you then take your Rails app and add it to that stack, how much, as a %, have you created? A decreasing amount, in any case.

The other force is the improvements in genetic and other optimization algorithms, combined with the inevitable scaling of computing power. That means that even though you may be really good at crafting code, and the machine generates garbage, it can improve that garbage very very quickly.

How different is it for me to copy and paste the sample code on the Ruby on Rails site to create a new application, than for a computer algorithm to follow those same steps? To be clear, there remain a lot of complex decisions to make, and I’m not suggesting algorithms can do so: things like distributed systems engineering, cache design, and really just the act of taking a user requirement and turning it into a test.

So how is this like the steel industry? I think it reflects commodification and then automation. Steel was largely hand-made for years, but the pressure of capitalism generated rapid improvements in reducing costs – largely labor costs. Process and parts became standardized, so it was possible to set up mills at much lower cost. The difference in quality between US and (say) Indian steel became small enough to not matter. But even in India, the pressures continue downward, so India’s dramatically lower labor costs still cannot compete with automation.

Some of these pressures don’t exist in software, of course: there is still a large knowledge component to it, and there are no health and safety costs in software labor (the hazards of RSI and sitting notwithstanding). So I don’t see any big changes immediately, but the software industry is probably where the steel industry was in the 20s. In 50 years I cannot see software being written by hand at the level it is now, with the exception (like in steel) of low-quantity, high-tolerance products like embedded development. The rest will be generated automatically by algorithms based on well specified requirements and test cases. Silicon Valley will become the rust belt of technology. You realize that Pittsburgh, birthplace of the steel industry, was once the most expensive city in the US, right?

If you doubt this, I think we are really arguing over when, and not what. My simplest example is coding interviews. Why test people on knowledge of algorithms that are well understood, to the point where they are in textbooks and well-used library code? The computer can write the FizzBuzz program faster and more efficiently than a human can. Over the next few decades, I believe Mark Harman’s optimization approach will encompass more and more of what we now do by hand.

4 Comments

Filed under Uncategorized

Garbage In, Garbage Out

My dad had this great cup from one of his vists to COMDEX (ostensibly to keep up with the latest in the tech world, which at the time COMDEX represented). It said “Garbage in, garbage out” (GIGO), and then had the name of some failed software company.

GIGO mug

GIGO mug (Cafepress)

 

I read a great blog about intermediate targets and over-optimizing what you measure (Hawthorne’s law) and the unintended side effects. Then I watched a presentation on the future of data visualization.

The commonality to me is this undesirable focus on the simple over the complex. So a dashboard can in a glance tell you how fast your car is going, which is useful because it maps to two concerns you have as a driver: obeying the speed limit laws, and maximizing your time in the car. I should say, “maps directly”, because as an indicator for these two concerns, speed is pretty much a 1-1 mapping. But consider a car indicator with a much poorer mapping to your concern: the “distance remaining” gauge new cars have. This tells you that based on some model of past driving behavior, you can expect to travel X more miles before the fuel runs out. The problem is this indicator is no longer a simple mapping. You have a (possibly non-linear) model of past behaviour (and no idea how far back the model goes); possibly inaccurate sensors (e.g., depending on temperature, the amount of fuel actually remaining might change); and finally, it is predicting future behavior (you will continue to drive to work tomorrow, and not go on a long distance highway drive).

In much the same way I think this fascination with metrics and dashboards confuses construct for concern. If I’m the government CIO, my concern is the value for taxpayer money each project is generating. But the dashboards are probably showing me constructs like estimated time to completion or lines of source code. Furthermore, and this is the data/info vis piece, those constructs are being mapped into visual variables using some arbitrary function. For instance, the decision to turn something from green to red might be based on a simple threshold chosen by an intern.

In broad stroked, constructs like source lines of code can, I think, be useful: logarithmically, perhaps, in the sense that a system with 100 thousand lines is more complex than one with only 10 thousand.

This typically isn’t how dashboards work, though. Thinking about numbers seems so innately arithmetic (5 is halfway between 1 and 9, not 3) that we cannot comprehend how little the dashboard is telling us. The Japanese lean movement has a nice word that captures what i think needs to happen: genchi genbatsu, “management by walking around”. In a factory, just looking at metrics for production speed and inventory is not the whole picture, and so long ago the Toyota production system creators learned that you had to actually walk the shop floor to see for your own eyes.

This is perhaps harder in the non-physical world of software, but I think for most of us we have a sense of project performance innately: are meetings productive? When was the last time you saw a working piece of code? Do you get quick answer to emails? While it is possible to metricize these things, probably it won’t help much more than buttonholing someone in the hallway.

Leave a comment

Filed under Uncategorized

Running a “Critical Research Review” at #RE15

Today we conducted our first attempt at “Critical Research Reviews” (CRR) at our workshop on empirical requirements engineering (EmpiRE) at the 2015 Requirements Engineering Conference.

CRR was introduced to me by Mark Guzdial’s post on the same exercise at ICER last year, which was run by Colleen Lewis. The idea (as I understand it) is to have researchers present work in progress, ideally at the research design stage. The purpose of the workshop is to “leverage smart people for an hour” in improving and stress-testing your research idea and methodology.

The cool part about doing this at EmpiRE is that our proposers got to leverage some of the leading empirical researchers in the RE community. These are the people likely reviewing your full paper, so it makes sense to get their critique up front.

We had three accepted “research proposal” papers as a special category for the workshop call. In the afternoon, (2pm-5.30pm) we had the three presenters do a 15 minute plenary presentation to get everyone in the workshop (25 or so) aware of the work. I restricted any questions so this was almost entirely over in 45 minutes. After a coffee break, I introduced the CRR concept and some ground rules, as well as a list of potential questions to consider. Then, for the next 45 minutes or so, the participants were invited to join the presenter that interested them and have a (polite) discussion about the proposed research.

Finally, I had asked each group to bring some wide-ranging thoughts back for the entire group for the last 30 minutes. My intent here was not to go into specifics on the proposals; rather, to get some other lessons that might be useful for the people who were not part of that particular group. This worked pretty well; it did tend to go into more detail than perhaps warranted, but it did stimulate some interesting discussion.

From what I heard, people quite enjoyed this approach to research evaluation. It’s much more fun trying to poke holes in research approaches when the author on the other end can rebut your arguments!  Look for another edition next year.

You can find my slides introducing the idea here, and our proceedings, with the presenters’ research proposals, will be posted whenever IEEE gets around to it.

Lessons learned

  • The room was terrible: large central conference table. One group retreated to the coffee room which had large circular tables.
  • No one used the flip charts: I think the presenters were writing their own notes down on their laptops anyway.
  • We mostly had established researchers presenting. In the future we are considering perhaps restricting this to early career or Phd students, who likely need the assistance more. But I think the more senior researchers still benefited. The primary difference, I think, will be that the senior researchers will have considered more of the potential threats.
  • I was main facilitator: having one group a 2 minute walk away made this harder. No group really needed help, but I can certainly see possibilities where that would be an issue. For instance, if you get too many people going to one presenter, or one person dominating the discussion, or too much negativity (the usual group dynamics, in other words).

Leave a comment

Filed under Uncategorized

A Field Study of Technical Debt

Over on my employer’s blog, I’ve written up our survey results on technical debt.

Leave a comment

Filed under Uncategorized