Bayesian Hierarchical Modeling in Software Engineering
At MSR18 in Gothenburg, I presented my work on using Bayesian inference to set software metrics thresholds. We want to set thresholds because for many software metrics, like coupling between objects (CBO), a single, global metric value (“all software objects with this value or below are maintainable”) is nonsensical, if only because programming language choice is important. So we want to tailor threshold values to some contextually relevant value (e.g., perhaps all Java code should be X or less). The question I answered is how we do the tailoring, given some contextual features.
In this case, the contextual features I was looking at were Java files categorized by architectural role in the Spring framework, derived from a paper by Mauricio Aniche and others.
The bottom line of this new approach is that we can use Bayesian inference and hierarchical models to perform a simple regression and get a 50% drop in root mean squared error (RMSE).
The more interesting conclusion from a methodology point of view is that hierarchical modeling with Bayesian inference fit software engineering data very well, and is straightforward to model given modern probabilistic programming languages. I followed a similar approach to the one detailed in this blog post on hierarchical modelling with PyStan. There are two key ideas.
- Use a combination of global data as regularization over the detailed, local model. In this case, the global data comes from all the different Java projects. The local model is the specific coupling metrics for one particular project. The effect is to allow each individual project’s slope and intercept values to vary by some amount dictated by the global values.
- We model this using a Bayesian approach, which means we will condition our likelihood based on the data we are observing, and use that to estimate a posterior distribution. I really like this approach because it forces you to think about your prior distribution (what should the metrics distribution be?), and also because it produces a posterior distribution, and not a single point estimate. A distribution is much more flexible for making inferences than a point estimate (e.g., we could say “set the threshold where < 75% of the probability mass lies”).
This was also a fun project to do from an open science approach. I used Jupyter as my notebook throughout the project, and my notebook and the paper/presentation are both available.