Bug Prediction at Google


What’s the problem?

Here at Google, we have thousands of engineers working on our code base every day. In fact, as previously noted, 50% of the Google code base changes every month. That’s a lot of code and a lot of people. In order to ensure that our code base stays healthy, Google primarily employs unit testing and code review for all new check-ins. When a piece of code is ready for submission, not only should all the current tests pass, but new tests should also be written for any new functionality. Once the tests are green, the code reviewer swoops in to make sure that the code is doing what it is supposed to, and stamps the legendary “LGTM” (Looks Good To Me) on the submission, and the code can be checked in.

However, Googlers work every day on increasingly more complex problems, providing the features and availability that our users depend on. Some of these problems are necessarily difficult to grapple with, leading to code that is unavoidably difficult. Sometimes, that code works very well, and is deployed without incident. Other times, the code creates issues again and again, as developers try to wrestle with the problem. For the sake of this article, we’ll call this second class of code “hot spots”. Perhaps a hot spot is resistant to unit testing, or maybe a very specific set of conditions can lead the code to fail. Usually, our diligent, experienced, and fearless code reviewers are able to spot any issues and resolve them. That said, we’re all human, and sneaky bugs are still able to creep in. We found that it can be difficult to realize when someone is changing a hot spot versus generally harmless code. Additionally, as Google’s code base and teams increase in size, it becomes more unlikely that the submitter and reviewer will even be aware that they’re changing a hot spot.

In order to help identify these hot spots and warn developers, we looked at bug prediction. Bug prediction uses machine-learning and statistical analysis to try to guess whether a piece of code is potentially buggy or not, usually within some confidence range. Source-based metrics that could be used for prediction are how many lines of code, how many dependencies are required and whether those dependencies are cyclic. These can work well, but these metrics are going to flag our necessarily difficult, but otherwise innocuous code, as well as our hot spots. We’re only worried about our hot spots, so how do we only find them? Well, we actually have a great, authoritative record of where code has been requiring fixes: our bug tracker and our source control commit log! The research (for example, FixCache) indicates that predicting bugs from the source history works very well, so we decided to deploy it at Google.


How it works
In the literature, Rahman et al. found that a very cheap algorithm actually performs almost as well as some very expensive bug-prediction algorithms. They found that simply ranking files by the number of times they’ve been changed with a bug-fixing commit (i.e. a commit which fixes a bug) will find the hot spots in a code base. Simple! This matches our intuition: if a file keeps requiring bug-fixes, it must be a hot spot because developers are clearly struggling with it.

Aside from the speed of execution, this algorithm is also very attractive as it’s easy to communicate to others: files are flagged if they have attracted a large number of bug-fixing commits, no more and no less. Some bug prediction algorithms use a large number of metrics and perform many calculations before they output a result, but how do we know it’s not a false positive? We don’t! Once developers start feeling a tool is lying to them, they’ll quickly stop using it. With the Rahman algorithm, whether a developer agrees with the prediction or not is up for debate, but no one can argue with the actual number it outputs.

We implemented the Rahman algorithm by creating a program that hooked into our source control system, and pulls out all the changes which had a bug attached to them. It looks at each bug number, and verifies with the bug-tracking database that it was really a bug, and filters out everything else, such as feature requests. It then looks at all the files that appeared in these changes, and filters out those that have been deleted and are no longer at HEAD. For each file, the number of bug-fixing changes it’s been in is calculated, and we output the files which were ranked in the top 10%.

We showed output to the development teams (you know, just to make sure). The response?

“Hey guys, this list looks great, but there’s a couple of files that used to be a problem, but we fixed them, so they shouldn’t be on here now.”

It turns out that while the Rahman algorithm shows us where hot spots are, it doesn’t adapt to changes readily. If a development team manages to nail down a hot spot and get it fixed, it’ll still appear in the list because of all the bug-fixing commits it created in the past.

What we needed was a way of prioritizing newer bug-fixing commits, and downgrading the value of old ones, so fixed files begin to fall down the list.

After some trial-and-error, we decided to score each file by weighting each bug-fixing commit by how old it is. As the commit gets older, so its influence tends towards 0.


Where n is the number of bug-fixing commits, and ti is the timestamp of the bug-fixing commit represented by i. The timestamp used in the equation is normalized from 0 to 1, where 0 is the earliest point in the code base, and 1 is now (where now is when the algorithm was run). Note that the score changes over time with this algorithm due to the moving normalization; it’s not meant to provide some objective score, only provide a means of comparison between one file and another at any one point in time.

Some of you might wonder why we don’t factor in the number of commits in the algorithm: a file that changes often as it’s being developed will get more bug-fixing commits. Wouldn’t it be fairer to look at the ratio of non-fixing commits to bug-fixing commits? Having trialled this, we found the results unsatisfying. Code churn has previously been pointed at as an indicator of the presence of defects (particularly by Nagappan and Ball), so employing a ratio removes that useful signal.

If we plot our equation, it looks like this:
Running using this scoring algorithm means that as commits get older, they are worth less and less. The drop-off happens quickly in order to really push up those newer bug-fixing changes and devalue the older ones. Files that don’t get many bug-fixing commits for a while will end up falling out of the top 10%.


How we’re using it
When a file is predicted to be a hot spot, we place a warning in our code review system on that file. Whenever a reviewer logs in to review that code, the warning will appear, which hopefully will encourage them to spend some more time reviewing the code, or hand off the review to someone more experienced if need be.


Conclusion
Bug prediction is not an objective measure by any means: the attentive amongst you will see it’s another tool that we can provide Googlers with in order to help them gain insight into their code. We hope that by highlighting code hot spots, we’ll help to stop tricky bugs making their way into the code base. We’ll be monitoring how developers are engaging with these reviews in the months to come.


– Chris Lewis and Rong Ou





Source link