Code change reviews are an important part of the software development process at scale, requiring significant time for code authors and code reviewers. As part of this process, a reviewer reviews the proposed code and requests code changes from the author through natural language comments. At Google, we see millions of reviewer comments per year, and authors require an average of ~60 minutes of active shepherding time between submitting changes for review and finally submitting the change. In our metrics, the amount of active work a coder must do to address reviewer comments increases almost linearly with the number of comments. However, with machine learning (ML), we have the ability to automate and streamline the code review process, for example by suggesting code changes based on comment text.
Today, we describe the application of recent advances in large sequence models in a real-world environment to automatically resolve code review comments in Google’s daily development workflow (forthcoming publication). As of today, code changers at Google are addressing a significant number of reviewer comments by applying ML’s recommended editing. We expect this to reduce time spent on code reviews by hundreds of thousands of hours per year at Google scale. Unsolicited, overwhelmingly positive feedback highlights that the impact of code edits offered by ML increases the productivity of Googlers and allows them to focus on more creative and complex tasks.
Prediction of code editing
We began training a model that predicts code changes needed to address reviewer comments. The model is pre-trained on various coding tasks and programmer-related actions (eg renaming a variable, repairing a broken structure, editing a file). It is then fine-tuned for this specific task, with revised code changes, reviewer comments, and the author’s edits to address those comments.
|An example of ML-recommended edits that are spread throughout the code.|
Google uses a monorepo, a single repository for all of its software artifacts, which allows our training dataset to include all of the unrestricted code used to build the latest Google software as well as previous versions.
To improve the quality of the model, we iterated the training dataset. For example, we compared model performance on datasets with a single reviewer’s annotation per file versus datasets with multiple annotations per file, and experimented with classifiers to clean training data based on a small, selected dataset to select the best offline model. measures of precision and recall.
Infrastructure and user experience maintenance
We designed and implemented the feature on top of the trained model, focusing on overall user experience and developer efficiency. As part of this, we explored different user experience (UX) alternatives through a series of user studies. We then refined the feature based on internal beta insights (ie testing a feature under development), including user feedback (eg a “Was this helpful” button next to a suggested edit).
The final model was calibrated with a target accuracy of 50%. That is, we tuned the model and suggestion filtering so that 50% of suggested edits in our evaluation database are correct. In general, increasing target accuracy decreases the number of suggested edits shown, and decreasing target accuracy results in more incorrect suggested edits. Incorrectly proposed changes cost developers time and reduce developer confidence in the feature. We found that a target accuracy of 50% provided a good balance.
At a high level, for each new reviewer comment, we generate model input in the same format used for training, query the model, and generate a suggested code edit. If the model is confident in the prediction and several additional heuristics are satisfied, we send the proposed edit to the power systems. Downstream systems, i.e. the code review front end and the integrated development environment (IDE), display proposed changes to the user and record user interactions such as preview and apply events. A custom pipeline aggregates these logs and generates aggregated insights, such as overall acceptance rates, as outlined in this blog post.
|The architecture of the edit infrastructure offered by ML. We pull code and infrastructure from multiple services, get model predictions, and expose the predictions in a code review tool and IDE.|
The developer interacts with the code review tool and the IDE’s ML suggested edits. Based on insights from user studies, integration into a code review tool is best suited for a streamlined review experience. IDE integration provides additional functionality and supports three-way merging of ML-recommended edits (left in the image below) on top of the reviewed code state (right) in case of conflicting local changes into the merge result (center).
|Merging 3-way UX in an IDE.|
Offline evaluations show that the model applies to 52% of the interpretations with a target accuracy of 50%. Online beta metrics and a full internal run confirm these offline metrics, meaning we see model recommendations above our target model confidence for around 50% of all relevant reviewer comments. Between 40% and 50% of all previewed suggested edits are applied by code authors.
We used “not helpful” feedback during beta to identify patterns of repeated model failures. We implemented a lead time heuristic to filter them out and thus reduce the number of incorrect predictions displayed. With these changes, we traded quantity for quality and saw an increase in real-world adoption rates.
|Code review tool UX. The recommendation is displayed as part of the comment and can be previewed, applied, and rated as helpful or unhelpful.|
Our beta launch showed a detectability challengeCode authors previewed ~20% of all proposed edits generated. We tweaked the UX and introduced a prominent “Show ML-Edit” button (see image above) next to the reviewer’s comment, which resulted in an overall ~40% preview rate at launch. In addition, we found that recommended changes in the code review tool were often not applicable due to conflicting changes made by the author during the review. We’ve addressed this with a button-based code review tool that opens the IDE in merge view for suggested editing. Now we notice that more than 70% of them are applied in the code review tool and less than 30% in the IDE. All of these changes allowed us to increase the total number of reviewer comments addressed by ML’s suggested edit 2x from beta to full internal launch. At Google’s scale, these results help automate the resolution of hundreds of thousands of comments each year.
|Recommendation filtering funnel.|
We see ML’s suggested edits that address a wide range of reviewer comments in production. This includes simple localized refactorings and refactorings spread within the code, as shown in the examples in the blog post above. The feature refers to longer and less formalized comments that require code generation, refactoring, and import.
|An example of a longer and less formally worded comment proposal that requires code generation, refactoring, and import.|
The model can also respond to complex comments and produce extensive code edits (shown below). The generated test case follows the existing unit test pattern, while changing the details as described in the comment. In addition, the editor offers a comprehensive test name that reflects the test’s semantics.
|An example of the model’s ability to respond to complex comments and generate extensive code edits.|
Conclusion and further work
In this post, we introduced the ML-assistance feature to reduce the time spent on code review related changes. The vast majority of all active code review comments for currently supported languages are addressed by Google’s proposed ML edits. A 12-week A/B test for all Google developers will further measure the feature’s impact on overall developer productivity.
We’re working on improvements across the board. This includes increasing model quality and recall and creating a more streamlined experience for the developer with improved discoverability during the review process. As part of this, we are exploring the option of showing suggested edits to the reviewer while they are designing comments and expanding functionality in the IDE to enable code change authors to receive suggested code edits for natural language commands.
This is the work of the Google Core Systems & Experiences team, Google Research, and many people at DeepMind. We would like to give special thanks to Peter Choi for putting the collaboration together, and to all our team members for their key contributions and helpful advice, including Marcus Revai, Gabriela Surita, Maxim Tabachnik, Jacob Austin, Nimesh Gelani, Dan Zheng, Peter Josling. , Mariana Stariolo, Chris Gorgolewski, Sasha Varkeviser, Katja Grünwedel, Alberto Elizondo, Tobias Welp, Paige Bailey, Pierre-Antoine Manzagol, Pascal Lamblin, Chengji Gu, Petros Maniatis, Henrik Michalewski, Sarah Wiltbergaon, A. Niranjan Tulpule, Zubin Ghahramani, Juanjo Karin, Danny Tarlow, Kevin Villela, Stoyan Nikolov, David Tattersall, Boris Bokovsky, Kathy Nix, Mehdi Gisassi, Louis C. Kobo, Yujia Lee, David Choi, Christoph Molnandar, Weitel, Brett Wiltshire, Laurent Le Brun, Mingpan Guo, Herman Luss, Jonas Matts, Savin Dans.