Reducing Spread Impact in Clustering Algorithms

20 Feb 2025

Authors:

(1) Andrew Draganov, Aarhus University and All authors contributed equally to this research;

(2) David Saulpic, Université Paris Cité & CNRS;

(3) Chris Schwiegelshohn, Aarhus University.

Table of Links

Abstract and 1 Introduction

2 Preliminaries and Related Work

2.1 On Sampling Strategies

2.2 Other Coreset Strategies

2.3 Coresets for Database Applications

2.4 Quadtree Embeddings

3 Fast-Coresets

4 Reducing the Impact of the Spread

4.1 Computing a crude upper-bound

4.2 From Approximate Solution to Reduced Spread

5 Fast Compression in Practice

5.1 Goal and Scope of the Empirical Analysis

5.2 Experimental Setup

5.3 Evaluating Sampling Strategies

5.4 Streaming Setting and 5.5 Takeaways

6 Conclusion

7 Acknowledgements

8 Proofs, Pseudo-Code, and Extensions and 8.1 Proof of Corollary 3.2

8.2 Reduction of k-means to k-median

8.3 Estimation of the Optimal Cost in a Tree

8.4 Extensions to Algorithm 1

References

4 Reducing the Impact of the Spread

We now show how one can replace the linear time-dependency on log ∆ with a logarithmic one (i.e., log ∆ → log(log ∆)).

Overview of the Approach Without loss of generality, we assume that the smallest pairwise distance is at least 1, and ∆ is a (known) upper-bound on the diameter of the input. To remove the log ∆ dependency, we proceed by producing a substitute dataset P ′ that has a lower-bound on its minimum distance and an upper-bound on its maximum distance. We then show that, with overwhelming likelihood, reasonable solutions on P ′ have the same cost as solutions on P up to an additive error.

In order to produce the substitute dataset P ′ , we first find a crude upper-bound on the cost of the optimal solution to the clustering task on P. We then create a grid such that, for every cluster in the optimal solution, the cluster is entirely contained in a grid cell with overwhelming likelihood – in some sense, points in different grid cell do not interact. We can then freely move these grid cells around without affecting the cost of the solution, as long as they stay far enough away so that points in different cells still do not interact. Thus, we can constrain the diameter of P ′ to be small with respect to the quality of the approximate solution. We show later how to constrain the minimum distance to be comparable to the quality of this approximate solution as well.

We will focus this section on the simpler k-median problem but show how to reduce k-means to this case in Section 8.2.

4.1 Computing a crude upper-bound

As described, we start by computing an approximate solution U such that OPT ≤ U ≤ poly(n)· OPT. For this, the first step is to embed the input into a quadtree. This embedding has two key properties. First, distances are preserved up to a multiplicative factor O(d log ∆), and therefore the k-median cost is preserved up to this factor as well. Second, the metric is a hierarchically separated tree: it can be represented with a tree, where points of P are the leafs. The distance between two points is then given by the depth of their lowest common ancestor – if it is at depth ℓ, their distance is √ d∆2−ℓ . Our first lemma shows that finding the first level of the tree for which the input lies in k + 1 disjoint subtrees provides us with the desired approximation.

We prove this in Section 8.3. A direct consequence is that the first level of the tree for which at least k + 1 cells are non empty allows us to compute an O(n)-approximation for k-median on the tree metric. Since the tree metric approximates the original Euclidean metric up to O(d log ∆), this is therefore an O(nd log ∆)-approximation to k-median in the Euclidean space.

To turn this observation into an algorithm, one needs to count the number of non-empty cells at a given level ℓ: for each point, we identify the cell that contains it using modulo operations. Furthermore, we count the number of distinct non-empty cells using a standard dictionary data structure. This is done in time O˜(nd), and pseudo-code is given Algorithm 2. Using a binary search on the O(log ∆) many levels then gives the following result:

Given this crude approximate solution, it remains to create a substitute dataset P ′ that satisfies two requirements:

1 First, P ′ must have spread linear in the quality of the approximate solution. If this holds, Algorithm 1 on P ′ will take O˜(nd log log ∆) time.

2 Second, any reasonable solution on P ′ should be roughly equivalent to a corresponding solution on P. This would mean that running Algorithm 1 on P ′ gives us a valid coreset for P.

4.2 From Approximate Solution to Reduced Spread

Proposition 4.4. Let P ′ be the dataset produced by Algorithm 3. It holds that:

Combining the algorithm of Lemma 4.2, which gives a bound on U, with Lemma 4.5 brings us to the following theorem:

To summarize, we have shown that one can, in O˜(nd) time, find a modified dataset P ′ that preserves the cost of corresponding k-means and k-median solutions from P. Importantly, this P ′ has spread that is logarithmic in the spread of P. As a result, one can apply Algorithm 1 onto P ′ in O˜(nd) time without compromising the compression quality with respect to P. Lastly, this compression on P ′ can be re-formatted onto P in O˜(nd) time.

This paper is available on arxiv under CC BY 4.0 DEED license.

← Previous

How to Make Big Data Clustering Faster and More Efficient

Up Next →

What Is the Best Way To Compress Big Data Without Sacrificing Accuracy?