Clustering is a central problem in unsupervised machine learning (ML) with many applications across domains in both industry and a wider range of academic research. At its core, clustering consists of the following problem: given a set of data items, the goal is to partition the data items into groups so that similar objects are in the same group, while different objects are in different groups. This problem has been studied in mathematics, computer science, operations research, and statistics for more than 60 years in its many variants. Two common forms of clustering are metric clustering, where the elements are points in a metric space like k-means problem and clustering graphs where elements are nodes of a graph whose edges represent the similarity between them.
|In k-means In a clustering problem, we are given a set of points in a metric space that we aim to identify k representative points, called centers (illustrated here as triangles), to minimize the sum of the squared distances from each point to its nearest center. Source, rights: CC-BY-SA-4.0|
Despite the extensive literature on the design of clustering algorithms, few practical works have focused on strictly protecting user privacy during clustering. When clustering is applied to private data (such as user queries), it is important to consider the privacy implications of using clustering solutions in a real system and how much information the output solution reveals about the input data.
One solution to ensure privacy in the strict sense is to develop differentiated private (DP) clustering algorithms. These algorithms ensure that the clustering result does not reveal private information about a particular data item (such as whether a user has completed a given query) or sensitive data about the input graph (such as relationships in a social network). Given the importance of privacy protection in unsupervised machine learning, in recent years Google has invested in research into the theory and practice of various private metrics or graph clustering, and in various contexts, such as heatmaps or tools for developing DP algorithms.
Today we are excited to announce two important updates. 1) a new differential private algorithm for hierarchical graph clustering, which we will present at ICML 2023, and 2) an open source release of an extensible differential private code. k– means algorithm. This code brings the differential private k– stands for clustering on large datasets using distributed computing. Here we will also discuss our work on clustering technology for recent deployment in healthcare to inform public health authorities.
Differential Private Hierarchical Clustering
Hierarchical clustering is a popular clustering approach that consists of recursively partitioning a database into clusters with increasingly finer granularity. A well-known example of hierarchical grouping in biology is the phylogenetic tree, where all life on Earth is divided into finer and finer groups (eg, kingdom, tribe, class, order, etc.). A hierarchical clustering algorithm takes as input a graph representing entity similarity and learns such recursive partitions in an unsupervised manner. However, at the time of our research, no algorithm was known that could compute hierarchical clustering of a graph with edge privacy, i.e., preserving the privacy of vertex interactions.
In “Differential Private Hierarchical Clustering with Provable Approximation Guarantees”, we consider how well the problem can be approximated in the DP context and establish upper and lower bounds for the privacy guarantee. We design an approximate algorithm (the first of its kind) with polynomial runtime that achieves an additive error scaled by the number of nodes. n: (order n:2.5:) and the multiplicative approximation of O (log:½: n), with a multiplicative error identical to the non-private parameter. We additionally provide a new lower bound on the additive error (of order n:2:) for any private algorithm (regardless of runtime) and provide an exponential time algorithm that satisfies this lower bound. Moreover, our paper includes a worst-case-outside analysis focusing on the hierarchical stochastic block model, a standard random graph model that exhibits a natural hierarchical cluster structure, and presents a private algorithm that provides a solution with an incremental value that exceeds the optimal one. is trivial for larger and larger graphs, again consistent with non-private modern approaches. We believe that this work extends the understanding of privacy-preserving algorithms on graph data and enables new applications in such settings.
Large-scale differential private clustering
We now switch gears and discuss our work for metric space clustering. Much of the previous work on DP metric clustering has focused on improving the approximation guarantees of algorithms. k– means objective, leaving scalability issues out of the picture. Indeed, it is not clear how efficient non-private algorithms such as k-means ++ or k-means// can be made private in different ways without significantly sacrificing either approximation guarantees or scalability. On the other hand, both scalability and privacy are paramount at Google. For this reason, we recently published several papers that address the problem of designing cluster-efficient differential private algorithms that can scale to massive data sets. Our goal is, moreover, to offer extensive input data sets, even when the target number of centers, kit’s big.
We work in the massively parallel computing (MPC) model, which is a computational model of modern distributed computing architecture. The model consists of several machines, each containing only part of the input data, working together to solve a global problem by minimizing the amount of communication between machines. We present an algorithm for approximating the differential private constant factor k– means that it only requires a constant number of synchronization steps. Our algorithm builds on our previous work on the problem (with code available here ), which was the first variational private clustering algorithm with provable approximation guarantees that could run in an MPC model.
The DP constant factor approximation algorithm dramatically improves upon previous work by using a two-stage approach. In the initial phase, it computes a rough approximation for the second phase “seeding”, which consists of a more complex distributed algorithm. Satisfied with the approximation of the first step, the second step relies on the Coreset literature results to subselect an appropriate set of input points and find a good solution for different private clustering for the input points. We then prove that this solution generalizes with approximately the same guarantee to the entire input.
Insights into graft retrieval using DP clustering
We then apply these advances in various private clustering to real-world applications. One example is our application of our differential private clustering solution to publish queries related to the COVID vaccine while providing strong privacy protections for users.
Vaccination Search Insights (VSI) aims to help public health decision makers (health authorities, government agencies, and non-profit organizations) identify and respond to the information needs of communities related to COVID vaccines. To achieve this, the tool allows users to explore the top topics searched by users in different geographic segments (zip code, county and state level in the US) related to COVID queries. Specifically, the tool visualizes statistics on trending queries of interest at a given location and time.
|Screenshot of the tool output. On the left, the top searches related to Covid vaccines between October 10-16, 2022. On the right, surveys with increasing importance during the same period and compared to the previous week.|
To better identify trending search topics, the tool aggregates search queries based on their semantic similarity. This is done by applying specially developed k-means-based algorithm is run on search data that has been anonymized using the DP Gaussian mechanism to add noise and remove low-count queries (thus leading to differential clustering). The method provides strong differential privacy guarantees for user data protection.
This tool provided fine-grained data on the perception of the COVID vaccine in the population at an unprecedented scale of granularity, which is especially important for understanding the needs of marginalized communities disproportionately affected by COVID. This project highlights the impact of our investments in research into various privacy and unsupervised AML techniques. We are looking at other important areas where we can apply these clustering techniques to help make decisions around global health challenges, such as surveys looking for climate change challenges like air quality or extreme heat.
We thank our co-authors Silvio Lattanzi, Wahab Mirokni, Andres Munoz Medina, Shyam Narayan, David Saulpik, Chris Schwiegelshon, Sergey Vasilvitsky, Pailin Zhong, and our colleagues Health AI the team that made VSI possible: Shailesh Bawadekar, Adam Boulanger, Tug Griffith, Mansi Kansal, Chaitanya Kamath, Akim Kumok, Yael Mayer, Tomer Shekel, Megan Schum, Charlotte Stanton, Mimi Sun, Swapnil Visput, and Mark Young.
For more information about the Graph Mining team (part of Algorithm and Optimization), visit our pages.