Technical Debt and GenAI
Why We Care About Technical Debt
Technical Debt refers to the short-term optimizations – shortcuts – that are taken with potential for longer-term consequences. A typical example is to forego extensive unit testing in order to get a product to market. Once the product is released, in theory the dev team goes back and does it the right way. Often, they do not go back (there’s another fire to put out), and so the TD accumulates, eventually causing problems (interest) like slower release times and impossible to maintain codebases.
For a long time this has been a useful characterization of the problem. I’ve written a book on the subject, with Julien Delange and Rick Kazman. In that book we covered a little bit on the emerging AI systems (this was 2021/22). A lot of the inspiration came from the paper “Machine Learning: The High Interest CC of Technical Debt”. Importantly, though, that referred to the area of technical debt in ML systems, e.g., data science models for customer behaviour.
What has emerged since our book has been the use of GenAI in software creation, characterized by tools like Claude Code and OpenAI Codex. Thus, in our second edition, I would incorporate something that looked at how technical debt is caused by, and can be resolved/paid off by, GenAI tools. A recent workshop on Technical Debt, summarized in the TechDebt manifesto, also touched on this topic.
Causing TechDebt with GenAI
A truism about software development is that code is a depreciating asset (an idea that has existed since the OS/360 work, from Lehman and others). It follows that reinvestment is needed to maintain the asset, and the more of that asset you have, the more you need to reinvest. Furthermore, you really hope someone on the team understands the dark crevices of the asset, the untouched corners that work with some duct tape and baling wire.
Writing Code is Rarely The Bottleneck
GenAI is really good at creating a lot of code. You can get it to spit out 100s of lines of working code in seconds or minutes. After all, all the tool is doing is taking your prompt, looking at what other people did in its training data, and regurgitating plausible looking examples. 1 We ran a small study last Fall with students learning web frameworks (Node, NextJs, Express, etc). A combination of a tight deadline and long list of deliverables meant students were forced to vibe-code applications, in languages and frameworks most had never used before. The result was lots of code that no one really understood. Talking to the students they were all aware that it had caused tons of technical debt in their application.
In this sense GenAI is like caffeine (remember Jolt Cola?): “Do Stupid Things, Faster”. I’ve yet to use one that would be risk-averse and ask if your request was what you actually wanted, absent a meaningful planning phase (“Do not write code, help me brainstorm the design”). It will happily say “OK boss” and churn out hundreds of lines of code. Most of it actually useful! But some potentially deadly (for safety, maintainability, performance, security).
One thing we have advocated in managing technical debt is to make it explicit. Having standups where people agree there is a TD problem, but do not commit to action or even explicit identification, is pointless. All you have done is reinforce that there is deferred maintenance, created bad vibes about the product, but given no concrete actions to do something about it. Instead, TD should be entered into the backlog, like anything else, and labeled as such. Making it manifest means conversations about paying it back are possible.
With GenAI, it is likely your AI has made shortcuts2 that you will never know about, let alone understand. This is the polar opposite of making TD explicit. While code is the way your product ideas are realized, just having a lot of code is not really the goal. The goal is the minimal amount of code necessary to satisfy the business objectives in the context of quality requirements. It is not clear to me that GenAI can be used to minimize such a function: the reward function is hard even for humans to express, and the amount of local context is extensive.
Fixing TD With GenAI
If GenAI can emit thousands of debt-laden tokens, causing TD, it can also help us fix these problems. A long-standing challenge, unsolved, is to retrospectively find sources of TD in a codebase. Attempt to resolve this problem have looked at, inter alia:
- self-admitted TD, places where a developer comments the code with a designation like “FIXME”.
- dependency information and code rules to quantify TD, using tools such as Sonarqube or Codescene or DV8.
- metrics such as LOC or the CK suite to identify complex code.
- refactoring detection and support.
The main challenge is to find the TD problems that developers may not know about, but should care about. It is easy to point out that File XY has a method that is 250 lines long. But chances are the devs know about this already and either don’t care or don’t know how to fix it. It is much harder (but more useful) to identify where the unknown unknowns are in a code base.
Understanding Code
AI tools have been quite useful for me in figuring out what is going on in a strange codebase. Because they have such an extensive training set, they are able to relate what I am looking at to similar examples, e.g., in other languages or domains. After all, the idea of “fetch data, do something to it, and store data” is a pretty common pattern. Thus using GenAI to navigate a complex code base, in order to detect TD, is a great use of that tool.
In our studies of GenAI for design, we have noticed GenAI tools struggle with local context. Sure, you can point them at the docs for the project, but there seems to be a lot of specialized knowledge that current RAG/Context Engineering/MCP approaches cannot help with. For example, historical tacit knowledge about how we tried to do it in a particular way, but could not. Or tradeoffs for performance or other quality attribute reasons. These design thinking aspects are harder to find in the training data, and consequently less likely to be manifested by the GenAI output. For a long time I have been looking for public sources of architecture decision making, but these are rarely present. Tradeoffs seem to happen tacitly in meeting rooms, or inside corporate internal wikis. As a result it is hard for GenAI to process this idea.
Finally, a tradeoff is a decision between two or more Pareto optimal solutions. GenAI fine-tuning is designed to pick a single outcome. Reinforcement learning, for example, tries to achieve a best outcome (win the game of chess), not present a set of options. We want our AI to climb the highest hill, not tell us there are several hills to choose from. Consider the navigation function in mapping apps. They will present several routes, and ask you to choose between them, precisely because they do not have access to your internal objective function (e.g., that this road is single lane alternating after 4pm, or that you prefer the longer, fewer stop light option).
The Road Ahead
While I think it is too soon to tell if people will still be needed to write code, I don’t see GenAI eliminating TD as a problem any time soon. In the short term, all that vibe-coded software will need someone to maintain it.3 And while GenAI will help us better understand these codebases, I’m skeptical it will be able to properly perform engineering tradeoff analysis. That is something we are actively researching - contact me if you want to help out.
I’m a big believer in the socially constructed nature of software. Too many software problems that I see are the result of human factors, such as power politics or management priorities. It is rare that purely technical problems are to blame. Thus4 I do not see GenAI removing the need for teams of humans to figure out what to build, what qualities it needs to adhere to, and how to keep it working.
Footnotes
To be clear, this is super impressive, and something I would not have said was possible even 5-6 years previously.↩︎
An example of a shortcut AI will take is to simply delete test cases↩︎
the typical software path is for new projects to appear much simpler than the legacy projects, until the new project becomes the legacy↩︎
and I’m aware this is a self-serving statement!↩︎