Columbus’s Heilmeyer Catechism

I have no idea if Columbus had to have his "India Expedition" proposal peer-reviewed, but here is my interpretation of it according to the ever-popular Heilmeyer catechism.

What are you trying to do

I would like to sail to India and bring back gold and spices for the Crown of Spain.

How is it done today

Currently no one has sailed west. Everyone takes the trip east, around the Cape of Good Hope. Most of these people think the world is flat and that heading west would cause us to fall into space.

What’s new in your approach

I will head west. I’m pretty sure the Earth is round, and we can reach India from the west in less time

Who cares?

A faster trading route to India, monopolized by our mapping skills, would generate 1 million Real a month for the royal treasury.


There is a lot unknown about the middle of the Atlantic, including rumors from the Vikings that some colder land is in between. My math may be off in calculating the circumference of the Earth. I am not a great sailor. We may encounter fierce alien tribes.

Cost and schedule

For 1000 Real we can outfit four boats with sailors, supplies, and weapons (note: of course Columbus would never get all he requested, either!). We plan on a quick 1 year voyage to India, and one more year back.

Checkpoints for success

We plan to see India after 2000 nautical miles of sailing. While measuring distance at sea is currently impossible, after 3 months we expect to sight land. If not, we will head back.

On SCAM’s new “Engineering Track”

This year SCAM, the Working Conference on Source Code Analysis and Manipulation (located in Raleigh, NC, Oct 2–3 2016) includes an engineering track, as described here. The CFP is available here. This track will be co-chaired by myself and Jurgen Vinju. In this post I want to briefly explain what an engineering track is and why you should submit to it![1]


Software is an engineering discipline, for most definitions of ‘engineering’. My definition, for what it’s worth, includes the notion that it involves working on real systems that do things, and to that end research in software engineering can be seen as a design science, where the chief task is to “design and investigate artifacts in context”.[2] This implies that for the most part researchers in this space need to concern themselves with pragmatics: how will this work at scale? How do people do this now? What data can we use that has practical relevance?

However, traditional conference submissions (the dominant form of scholarly dissemination in Computer Science) tend to follow the 10page, aim/motivation/observations/conclusions framework, often full of Greek letters and references to obscure papers. Whether this is a good way to advance the engineering discipline is debatable, but in any event, such a submission tends to ignore two things: one, how people dealing with problems in practice can use the work; two, the artefacts related to the scientific endeavor (the ‘treatment’ in Wieringa’s design science parlance). While improving, too many research papers still do not include tool downloads, fail to show practical impact, or fail to provide for data download to replicate the findings.

Our engineering track is out to improve the practical, engineering-relevant side of source code analysis and manipulation.

Submission types

This track has evolved from the tool track of previous SCAMs. As David mentions,

This is not to discourage tool paper submissions–they will now fall into the Engineering Track–but to broaden the scope of the tools track … for those of you that invest blood, sweat, and tears into tooling, infrastructure, or realistic field studies SCAM recognizes the value of this work, which is not always pure research, and we are designing this track to attract that type of work.

What artefacts qualify as “engineering track” material (from CFP)?

  • tools: software (or hardware!) programs that facilitate SCAMmy activities.
  • libraries: reusable API-enabled frameworks for the above.
  • infrastructure: while libraries are purely software, infrastructure can include projects that provide/facilitate access to data and analysis.
  • data: reusable datasets for other researchers to replicated and innovate with.
  • real world studies enabled by these advances. Here the focus is on how the {tool,infrastructure, etc} enabled the study, and not so much the study itself. Novelty of the research question is less important than the engineering challenges faced in the study.

Some of the criteria the PC will look at includes:

  • How well motivated are the use cases (and hence the existence) of the engineering work. Here we are asking whether this solves some realistic and ongoing challenge in practice. However, we are open to brilliant new ideas that scratch a previously unknown itch[3].
  • Relate the engineering project to earlier work. All engineering is a product of lessons learned, so including some narrative about how this particular submission has evolved is useful (e.g., what paths turned out to be dead ends).

Optionally (and encouraged):

  • Any empirical results or user feedback is welcome.
  • Contain the URL of a website where the tool/library/data etcetera can be downloaded, together with example data and installation guidelines, preferably but not necessarily open source
  • Contain the URL to a video demonstrating the usage of the contribution.

Ideally one would submit and make public the artifacts and required steps to create it. However, realistically people may not be able to (given IP rules, NDAs, etc.).

Program Committee

Building on SCAM general chair David Shepherd’s excellent blog post on industry tracks, both Jurgen and I are committed to a program committee (PC) that has strong industry representation. That doesn’t mean only people who work in industry, but at least means people who have some sense of the engineering challenges of building real-world software. The purpose is to vet submissions against the standards industry holds: not necessarily will work right away at scale in mission critical systems, but that there is some promise of that.

Incidentally, if you are a former academic now practicing, or just a research-minded practitioner, I would love to hear from you for future PCs. We need more folks straddling the two cultures.

“Related Work”

We are not the only place thinking of how to expand and include more non-traditional research papers. At MSR (Working Conference on Mining Software Repositories) there is a data track, a tools track, and a mining challenge.

One of my favorite venues, the International Conference on Requirements Engineering, has long had what I have found to be the strongest industry focus of any software conference. In part I think this is because RE is implicitly concerned with what business needs, but it also reflects a purposeful ambition to increase relevance of the research results. For example, there is a “Ready-Set-Transfer!” panel in which academics present tools to practitioners to review practical readiness.

Practitioner conferences are (almost by definition) industry focused [4] and both the Agile series of conferences and the XP conference include mirror-world ‘research’ tracks.

  1. Incidentally, I agree with and support the ICSME co-chairs’ statement on the anti-LGBT legislation in North Carolina.  ↩
  2. That definition is from Roel Wieringa’s excellent design science book.  ↩
  3. can itches be unknown? I may be mixing metaphors.  ↩
  4. Incidentally, I am not a big fan of the term “industry” or “industrial”. Maybe it is my location in Pittsburgh, but it conjures up steel mills and heavy machinery. The other problem is the term “industry” is used as a catch-all for a widely different set of folks, from a 2 person startup to a Fortune 500 company or DOD agency. I prefer research vs practice. Not a huge fan of “real-world” either, since we all live in the real world. Presumably.  ↩

On Using Open Data in Software Engineering

I recently reviewed data showcase papers for the Mining Software Repositories Conference, and I’m co-chair of the Engineering track (subsumes datasets, tools, approaches) for the SCAM conference[1]. I’ve worked with a number of different datasets (both openly available and closed) for my research efforts. This caused me to do some reflection on the nature of empirical data in SE.

We’ve had a nice increase in the amount of data available for researchers to explore, and most recently, the amount of well-constructed, easily understandable and accessible datasets – like the GHTorrent tool – is impressive (traditionally it has been difficult to get any credit for creating these resources). I think it is a hugely beneficial effort for our efforts to create a well-grounded, empirical basis for software engineering (as opposed to pie in the sky theorizing).

I have two concerns that threaten this idyll.

Concern 1: long term availability and replication

Other fields, primarily psychology and economics, have begun painful self-examinations of well-cherished, supposedly well-grounded results. In many cases replication of these findings is very difficult. What worries me most is that in empirical SE, we aren’t even in a position to attempt to replicate key findings, because (among other concerns) the original datasets aren’t accessible. This is most often because studying companies is usually caveated with “but you can’t tell anyone it was us”; other problems include datasets that are ‘live’ and self-modifying (for example, publishing a study on foul language in commit histories might mean those histories are modified), or just out of date entirely.

I propose two solutions. The first is to use long-term archiving approaches. This means no personal university website, no special database formats, no ‘email me for access’. The open access/open research communities have been tackling this problem for several years, and good solutions exist (like identifiers from DataCite or hosting like with Felienne Hermans’s Figshare). We have software specific repositories as well, of which the Promise repo is the one I know best. The 2016 CHASE workshop encouraged authors to stick papers on ArXiv, but the data should be similarly archived. One problem here is how to store data that institutional review boards might insist be destroyed.

The second is how we cite the data when we publish it in a journal or conference paper. I’ve seen URLs in the body of the paper, standard reference lists at the end of the paper, footnotes, and comments in the author’s notes. We need to standardize this so readers can quickly and easily find the URL to the dataset. My suggestion is to add a separate heading, like the useless ACM keywords, referencing any new datasets used/available, immediately after the abstract.

Concern 2: new ethical questions

My second concern is one I only recently realized (somewhat shamefully), and this is the ethics of publishing data. This was prompted by the GHtorrent debate. There the central objection from (some) developers was that their email address was accessible, perhaps more easily accessible than on Github. Email is an obvious one, but beyond that I think we need to acknowledge that the mental model of a developer pushing code to Github (or wherever) is not one of public visibility. The silly comments they write (“huge HACK!”) could be career-limiting, and pushing that dataset out into the world is a question we have to tackle. My view would be to explicitly include terms of use in data you make available, and Georgios is obscuring identities (e.g. with a SHA1 hash) when asked.

More broadly, the world of SE analytics opens up the chance that the data could be used for purposes we might not understand. For instance, one might show that a particular Chromium developer has an unusually high number of bugs. As researchers we might understand the nuances and limitations of how the data was collected[2]. Managers might not understand the limitations, however. Furthermore, any data one collects is subject to subpoena and other legal requests, so in many cases immediate destruction is the best course (see the debate on sociology notes).

The big risk from NOT considering this is losing the goodwill of a community that only tangentially understands what software engineering research is. I’m not advocating for doing whatever developers tell you to[3]; I am saying that not considering these issues risks antagonizing the people we are trying to help.

  1. Submit early, submit often!  ↩
  2. Jorge Aranda’s ‘secret life of bugs’ paper sheds light on this.  ↩
  3. You put your freaking email on Github! What did you expect!? (sigh)  ↩

The Marginal Utility of Testing/Refactoring/Thinking

Andy Zaidman had an interesting presentation about test analytics. The takeaway for me was that a) people overestimate their unit test engineering (estimate: 50%, reality, 25%). But b) the real issue is convincing a developer that this unit test will improve the quality of the code. In other words, like with technical debt, or refactoring, or commenting, the marginal utility of adding a test is perceived to be low (and of course the cost is seen as high). Each new individual test adds nothing to the immediate benefit (with some exceptions if one is following strict TDD). And yet each one requires switching from the mental model of the program to the one of Junit frameworks and test harnesses.

The issue is not whether testing is good or bad, but rather, which testing is most useful. It seems unlikely to me that the value of individual tests is normally distributed but rather power-law form (i.e., that there are a very few extremely high value tests). And this isn’t just about testing; indeed, most activities with delayed payoff—refactoring, documenting, architecting—likely exhibit the same problem. It is hard to convince people to invest in such activities without giving them concrete proof it is valuable. You just have to look at the default examples for Cucumber, for instance, to see that the vast majority are trivial and easily grasped without any of the tests. Similarly, “code smells are bad”, but bad might just mean they look nasty, while having little to do with the underlying effectiveness of the code. It isn’t technical debt if it never causes a problem. It isn’t a bug if it isn’t worth fixing it.

In new work we are starting with Tim Menzies, we are trying to understand the inflection point beyond which your decisions add little incremental value (i.e., stop adding more tests). The good news is this is easy to spot in hindsight; the challenge is to take those lessons and determine this before doing hours of pointless work. The direction we are taking is to try and capture the common patterns the key decisions share (in the testing example, perhaps this is bounds testing). Ultimately, we hope to provide advice to developers as to when the marginal utility falls below a threshold (i.e., stop testing!)

The other point is the over-reliance of software engineering on hoary folklore. Things like “some developers are 10x as productive”, or “80% of bugs occur in requirements”, tend to be statements that are derived from a single study, conducted in 1985, on 3 large scale defense projects, but have somehow made their way down the years to become canon. Ours is not the only field to suffer from this, of course. But when capable developers refuse to pay 200$ a year to join the IEEE Digital Library, it seems to demonstrate a firm commitment to ignorance.

A Model of Software Quality Checks

Software quality can be automatically checked by tools like SonarQube, CAST, FindBugs, Coverity, etc. But often these tools encompass several different classes of checks on quality. I propose the following hierarchy to organize these rules.

Level 0: Syntax quality

Focus: code that ‘runs’.

Level 0 means a compiler or interpreter’s components (parsers, lexers, intermediate forms) assess syntax correctness. Level 0 because (clearly) without proper syntax nothing is getting done.

Level 1: Lint-free

Focus: Code that respects obvious sources of problem.

No warnings occur if all possible flags are turned on in the compiler. These warnings tend to be close to syntax in their complexity. For example, technically a fall through switch statement is possible in Java, but there is the -Xlint:fallthrough tag to catch this. Often IDEs such as Eclipse will flag these automatically with warning icons.

Level 2: Good code

Focus: Code conforms to commonly accepted best practices for that language.

E.g., for Java, visibility modifiers are suitable, in C, no buffer overflows, memory is released appropriately. Some cross-language practices apply: documentation, unit tests exist, and so on. Many of the quality analysis tools like FIndBugs operate at this level. CWEs are another example. I also place dependency analysis approaches here (perhaps controversially). It also pops up in the next level (e.g., properly using interfaces in Java).

Level 3: Paradigmatic

Focus: writing code that is maintainable, understandable, and performant with respect to its runtime environment.

Would someone writing object-oriented, functional, embedded, etc. code consider this reasonable? Includes principles like SOLID, functional side effects, memory management, distributed code demonstrates awareness of fundamentals of distributed computing. Also includes proper use of language idioms e.g. proper use of Javascript callbacks, Ruby blocks, etc. We might also classify new language features here — the use of generics in Java 7 comes to mind. Essentially, if you did a peer review with a language guru (Odersky for Scala, say), would they have a ‘better way’ to do it? (Perl notwithstanding…)

Level 4: Well-designed

Focus: building systems that respect appropriate (known at the time) usage scenarios.

Given the knowledge available, the code is architecturally appropriate for the quality attribute requirement (QAR) applicable. E.g., modular, performant, secure. The key here is understanding the relevant QARs. Examples include reflexion models (like ArchJava), conformance checking (e.g. Dicto), library analysis (e.g., for license issues, for currency).


A few things become clearer when we view software quality with this approach.

First, I think that quality checks become more useful as you move ‘up’ (0→4) in the hierarchy. That is, I’d rather know that I have a serious design problem than a code quality problem.

Second, unfortunately, it seems much harder to design truly automated checks at the higher levels. This is why we have a lot of manual architecture analysis but leave code quality to tools.

Third, our rules get more context-specific as we move up the hierarchy. I.e., in order to properly check paradigmaticness1, I need to know your choice of programming language and possibly your problem domain properties. To properly do design validation, I need to know what qualities are important to you: performance? availability? That, I think, is partly what makes these levels more useful.

Other hierarchies

The one I’m most familiar with is from Jean-Louis Letouzey. He proposed the SQALE quality model, and his central insight is that some qualities precede others: you must have maintainable code before having performant code, or testable code before secure code.

EDIT [1/6/16]: somehow I forgot this CAST diagram showing different levels of analysis, very similar to mine. They also claim that the ‘system level’ (my design level) is the place where architecture is checked.

  1. I’m not sure how to ‘noun’ this adjective … 

Requirements, Agile, and Finding Errors

It’s a long held view in the requirements engineering (RE) community that “if only we could do RE better, software development would be cheaper”. Here ‘doing RE better’ means that your requirements document adheres to some quality standard such as IEEE 830. For example, none of the requirements are ambiguous.

One justification is that, based on (very few) studies in the late 80s, requirements errors cost a lot more to fix in test/production than when they are incurred. For instance, if I tell a subcontractor she has a 100 kilobyte message size limit, and I really meant 100 kilobits, fixing that problem after she has delivered the subcomponent will be expensive. This seems obvious. But two problems emerge. 1) Why does she have to wait so long to integrate the subcomponent? 2) how many of these problems are there? Granted it is cheaper to fix that particular error in the requirements/system engineering phase, how much money should we spend to find these errors at that point? [1]

An interesting early experiment on this is described in Davis, 1989, “Identification of errors in software requirements through use of automated requirements tools”, Information and Software Technology 31(9) p472–476. In an example of an experiment we see very rarely these days, his team were given sufficient funds to have three automated requirements quality tools applied to a large software requirements specification for the US Army (200,000 pages!). The tools were able to find several hundred errors in the spec, including errors of inconsistency. Yay, the tools worked! But….

The program had decided to go ahead and build their (Cobol) system before the automated analysis. The developers on the program didn’t care much about the findings. 80 of the 220 modules were not detectable in the final system (meaning, presumably, they were either merged or omitted altogether). Davis did some post-delivery follow-up, showing that the modules with greater numbers of requirements problems had a significantly greater number of post-release defects. But whether the two are causally related is hard to say (those modules may simply be more complex in general, so both requirements and code are harder to get right).

What I conclude from this is that finding errors of the sort they did, e.g.,

PROBLEM: the referenced table directs that PART_NO be moved from the WORK_ORDER_FILE to the WORK_TASK_FILE. Available fields in the WORK_TASK_FILE include PART_NO_FiELD_PART and PART_NO_FIELD_TASK.

CHOICE: We assume that PART NO FIELD_TASK is the proper destination.

are ultimately of zero value to document. As a result, finding problems with them, automated or otherwise, is also of no value. Of course we know all this from the past 20 years of the agile movement, but it is interesting to see it in action. I think that (in 1989 certainly) this was excusable, as the program managers had no good sense of what made software special. The level of detail the design describes, down to field names and dependencies, is better suited to the Apollo program, where they prescribe how tightly to turn bolts, label each individual bolt, etc. Which makes sense in a safety critical dynamic environment, but not a lot of sense in an office logistics tool.

Going Forward

A term I loathe but seems better than “Future Work”. I’ve worked a lot on automated requirements tools like PSL/PSA or SREM, so where should we head with automated tooling for requirements?

There is a lot of empirical evidence that simple, easily integrated process patterns such as requirements goals and scenarios lead to higher quality requirements. Intel, for example, are strong believers in training staff in writing good requirements (although notice their domain is also hardware-oriented and mistakes are costly). Even in agile settings I believe there are big improvements to be gained in writing better user stories (e.g., how to create the “Magic Backlog” described in Rebecca Wirfs-Brock’s EuroPLoP 2015 paper).

Furthermore, we are seeing more and more use of machine learning to flag requirements problems. For example, at Daimler they have simple detectors for checking requirements. And at Rolls-Royce, based on simple training exercises, they label requirements based on potential risk, combining uncertainty, change impact and cost into an index. All of these types of tools integrate will into a developer analytics approach, able to populate dashboards and flag things unobtrusively (compared with the cost of writing requirements formally).

Like with any analytics techniques, which ones to apply is situation-specific. Small companies doing the same things in well-understood domains won’t need much, if any requirements analysis. I think there is a lot of room for intelligent augmentation of what makes a good requirement, that facilities conversations and discovery of uncertainty, that automates the repeated and boring tasks (if you cannot possibly avoid creating a 2000 page document …). And in specialized domains, we are moving to a world where more and more of the analysis can be done in models, to verify timing requirements, guarantee that software partitions hold, and so on. Here the line between ‘requirement’ and ‘design solution’ is blurry, because requirements at one level become design solutions at the next level. A mature requirements practice would leverage this to enable experimentation and prototyping in silico, as it were, finding design problems before releasing products or fabricating chips.

Finding Defect Leakage

A major goal for large programs is to reduce defect leakage, the number of bugs that make it to production (to put it more precisely, reduce the number of critical bugs that make it to production). It seems to me there are at least four complementary approaches to this issue:

  • We could do this manually, and insist on writing good requirements using checklists, training, inspection, etc.
  • We could use formal methods, on well-formed architectural models, looking for very specific rule violations (safety, security, performance);
  • We could apply machine learning tools on past artifacts and try to leverage experience to predict problems. Not every requirement is equally important (obvious but not always followed).
  • We could design a process that accepted the inevitability of change and made it not only possible, but desirable to change design and requirements in response to new knowledge.

For the automated tools, I have this quick list of principles. Much like software analytics in general:

  1. Don’t make life worse. Developers should not dread having to do this. That said, an ounce of pain is worth a pound of pleasure.
  2. Work with existing tools like Doors, Jira and Excel. Your Eclipse plugin does not count.
  3. Don’t mandate new or complex languages or tools for requirements. We can barely get engineers to write requirements in natural language as it is.
  4. Prefer lightweight, high value checks over complex, theoretically appealing ones. Socialize people to the value of checking anything before insisting on the complex stuff.
  5. Integrate with existing dashboards like Shipshape or SonarQube. These tools have good plugin frameworks and already integrate with many build and CI servers.
  6. Facilitate conversations and early delivery of results. Remember that requirements engineering is the start of a conversation that gets us to a valuable solution. It is never an end in itself. In very few domains does assuming requirements won’t change get you anywhere.

  1. And Basili and Weiss’s 1981 study on the A7 program’s change requests and requirements suggest a power-law distribution to the most costly (e.g., > 1 person-month of effort) changes.  ↩

How Writing Code is Like Making Steel

I saw an interesting keynote from Mark Harman recently, on search-based software improvement. Mark’s lab at UCL also pioneered this idea of automatic code transplants using optimization techniques.

I think if you are an engineer who does fairly standard software development you should be concerned. The ultimate vision is to be able to take some specification with thorough tests, written in a language at a high-level of abstraction (e.g., here is my corporate color palette, here are my security requirements) and automatically generate the application.

There are several forces at play here. One is the increasing componentization of large and complex pieces of software. We’ve always had software reuse, but it tended to be at a much smaller level – the ODBC api, or the OAuth framework. Now our frameworks reach much larger areas of concern, particularly when we look at container technology running on commodity hardware. Someone else is maintaining huge chunks of your software base, in those cases: the OS, the backend, the messaging system, etc. If you then take your Rails app and add it to that stack, how much, as a %, have you created? A decreasing amount, in any case.

The other force is the improvements in genetic and other optimization algorithms, combined with the inevitable scaling of computing power. That means that even though you may be really good at crafting code, and the machine generates garbage, it can improve that garbage very very quickly.

How different is it for me to copy and paste the sample code on the Ruby on Rails site to create a new application, than for a computer algorithm to follow those same steps? To be clear, there remain a lot of complex decisions to make, and I’m not suggesting algorithms can do so: things like distributed systems engineering, cache design, and really just the act of taking a user requirement and turning it into a test.

So how is this like the steel industry? I think it reflects commodification and then automation. Steel was largely hand-made for years, but the pressure of capitalism generated rapid improvements in reducing costs – largely labor costs. Process and parts became standardized, so it was possible to set up mills at much lower cost. The difference in quality between US and (say) Indian steel became small enough to not matter. But even in India, the pressures continue downward, so India’s dramatically lower labor costs still cannot compete with automation.

Some of these pressures don’t exist in software, of course: there is still a large knowledge component to it, and there are no health and safety costs in software labor (the hazards of RSI and sitting notwithstanding). So I don’t see any big changes immediately, but the software industry is probably where the steel industry was in the 20s. In 50 years I cannot see software being written by hand at the level it is now, with the exception (like in steel) of low-quantity, high-tolerance products like embedded development. The rest will be generated automatically by algorithms based on well specified requirements and test cases. Silicon Valley will become the rust belt of technology. You realize that Pittsburgh, birthplace of the steel industry, was once the most expensive city in the US, right?

If you doubt this, I think we are really arguing over when, and not what. My simplest example is coding interviews. Why test people on knowledge of algorithms that are well understood, to the point where they are in textbooks and well-used library code? The computer can write the FizzBuzz program faster and more efficiently than a human can. Over the next few decades, I believe Mark Harman’s optimization approach will encompass more and more of what we now do by hand.