On Bots in Software Engineering


Neil Ernst


April 15, 2021

Bots as assistants

The state of bots

There have been a number of categorizations of bots for software development. The main categories seem to be the ones that Erlenhov came up with, which look at bots as either API endpoints (CI tools), developer assistants, or something more sophisticated. I don’t think there is much to say about CI tooling. This is a spectrum in my view; we will likely see CI tasks grow more sophisticated and multistep.

The paper Peggy and Alexei wrote in 2015 seems to indicate nothing has changed much since that paper. In one sense we are still just doing the same thing; bots as API interaction points, where they automatically or under prompting carry out some well defined and usually pre-existing task like a compiler and compiler flags.

This is in contrast to the bots that appear on corporate sites to maximize engagement and increase sales. There the bot acts as a query refinement tool, helping you to sort out what you actually are looking for. This type of extensive interaction seems to contradict what developers want.

So one question is instead of what bots should do, what tasks are we looking to help with. Bots are useful because they encapsulate common tasks that otherwise humans would have to do. For example we once took punch cards to an operator to enter into the computer; now that is done by hitting compile/run. I had a hierarchy of problems we can get help on:

  1. Syntax problems - compilers, integrated into most IDEs, now flag obvious problems before you need to run the code. IntelliJ for example can automatically detect type problems.
  2. Warnings and flags - running a compiler with default options just gets the bare minimum. Standards like MISRA-C specify known problems the compiler can find for you.
  3. Linters - plenty of problems with code are known, so we can find these things that clearly violate best practices, such as equality checks in Javascript. Often these are integrated into tools like SonarQube or CI environments.
  4. Code smells - the next class of code issues are called smells and have to do with slightly more complex problems, often spanning multiple modules. So for example long methods, long parameter lists, and so on. This is also where we might find violations of language paradigms, such as not using list comprehensions in Python.
  5. Design problems - the highest level of problems we might ask for help with are design flaws. Here we want to flag issues that we think will impinge on quality attribute satisfaction. We might identify that the code misuses the Obsever pattern.

There are some others that are a bit tangential to these: flagging security problems; identifying dependency or build issues; UI problems; test coverage issues.

The fundamental issue in my view for AI assistance is that it is super easy to tell people what they already know. So in self-driving the issue is not to drive from A to B in the sun; I could probably train my son to do that. What we need is meaningful assistance, telling us the things we didn’t know, or couldn’t know, on our own. Thus bug localization that finds bugs we already know about, code smell detectors we don’t care about, and most other data studies published on effort estimation and so on.

The challenge is that getting tools to the 70% accuracy level is super easy, with today’s tools. But humans are insanely smart, and so that first 70% is not the hard part. It is in the last 30% that we need the help, but also the hardest to decipher; figuring out what the image is really showing in the dark.

I think bots are similar. Right now they are basically just dumb endpoints to an API with a slightly improved interface. Thus Dependabot telling us that a library is outdated is not particularly interesting, since it is just an interface to a more complex script running in the background. What’s novel in Dependabot is that it is able to interact in a well understood way, not the complexity of what it is trying to do. Similarly the bots one sees on airline sites are not interesting assistants, but only simplistic interaction techniques in a web search world. After all John Mylopoulos was working on natural language interfaces to databases 40 years ago.

So what does this more interesting bot look like then? Is it more than just an API endpoint or something else? In my view the next step is truly an assistant that is contextualized to the person asking the question, to the project in which it runs, and the particular time it is being run. It is in short a very efficient, Ms Moneypenny like administrative assistant, capable of anticipating the needs of the developer, but unobtrusively.

How do we get there? There’s a number of pieces for this vision.

  • Interface: how do developers like to get information? Right now that is things like IDEs, compiler warnings, interactions on pull requests: working with the artifacts they already care about
  • Persona: how should an assistant interrupt? Should they be factual and say we detected a problem on line 50? Or should we omit GUilfoyle’s annoying verbal tics?
  • Context: how does the assistant extract the context information it will need to be useful and not annoying?
  • Metrics: what would be a relevant way of assessing success? I don’t think we even have a good idea on this.


  1. Detailed tasks: Mylyn style tasks that have to do with a specific problem like finding a bug, refactoring a method
  2. High level tasks: get an overview of the system in order to see how it is progressing. This bot might send a weekly update on lines of code added. Sort of exists as Github’s various visualizations.
  3. Design tasks: help me understand how the software will respond to quality attribute requirements.

Tasks bots can help with

Are bots just “API endpoints”?

Bots are api calls plus “vocal tics” like the fridge in Silicon Valley

“It’s bad enough it has to talk. Does it need fake vocal tics like ‘ah’?

“The tics make it seem more human,” Dinesh tells Gilfoyle.

“Humans are shit,” Gilfoyle replies. “This is addressing problems that don’t exist. It’s solutionism at its worst. We are dumbing down machines that are inherently superior.”

The challenge in these systems has always been that entry level knowledge is extremely easy to retrieve, but going deeper is way harder (kind of like self-driving). For perhaps 80% of the interactions online, the bot can manage. But it is in the details that bots get stuck and need to call for the operator to step in. We saw something like this with expert systems. Coding something that can advise people to call their doctor when they report a fever of 102 or higher is pretty simple. But the complex explanations as to what is causing the fever are fairly intractable (explanation is usually thought of as NP hard, after all). Getting the knowledge in to solve the problem (basically, all the heuristics and learning that an experienced GP would have) is very expensive - the KA bottleneck. This is probably less costly now with deep learning. But the other bottleneck is the reasoning. Even if we have that knowledge, inference to multiple competing explanations is very expensive. Recommending the most common explanation—such as viral ear infection in a toddler—is what bots basically do now, but in many cases there isn’t a clear common explanation, or there is no clear set of symptoms to diagnose.

Bots for TD reduction

One area we see a lot of activity is in static code analysis to find rule violations. There are more rules than programmers could reasonably want to use; code quality checks, syntax warnings, code smells, etc. The problem in fact is these warnings annoy developers. At Google they had a scheme where the code checks would be rejected if they had more than 10% false positives which developers could vote on. These tools generate multiples more warnings than developers actually take action on.

How could bots help? Well, the main issue I notice is the need for interactivity. Bots could easily process the boring problems in one shot (fix all trailing commas), but more importantly the bot could be an interface to the tool, instead of the common approach which is either a dashboard with TMI, or some simple weekly report. The bot instead could be a text interface to the tool itself, and run predefined queries or make new queries, adapting on the fly as the situation demands it. So bots are good for rapid re-contextualization which you do not see with dashboards, which require sophisticated analysis to configure.

A bot is also automatable so it could delivery weekly updates to the developer without him or her having to do anything with that info. Again though it seems like we are pushing the complexity - what reports do I really need - into another interface. The tough problem in TD presentation is to figure out exactly what context underlies the data and only show that.

We need to find problems and generate data

We need to filter and store the data

We need to query the data

We need to visualize the data.

Bots don’t make any of these easier per se….