How to Fit an Elephant in a Spreadsheet

20 Feb 2025

Authors:

(1) Andrew Draganov, Aarhus University and All authors contributed equally to this research;

(2) David Saulpic, Université Paris Cité & CNRS;

(3) Chris Schwiegelshohn, Aarhus University.

Table of Links

Abstract and 1 Introduction

2 Preliminaries and Related Work

2.1 On Sampling Strategies

2.2 Other Coreset Strategies

2.3 Coresets for Database Applications

2.4 Quadtree Embeddings

3 Fast-Coresets

4 Reducing the Impact of the Spread

4.1 Computing a crude upper-bound

4.2 From Approximate Solution to Reduced Spread

5 Fast Compression in Practice

5.1 Goal and Scope of the Empirical Analysis

5.2 Experimental Setup

5.3 Evaluating Sampling Strategies

5.4 Streaming Setting and 5.5 Takeaways

6 Conclusion

7 Acknowledgements

8 Proofs, Pseudo-Code, and Extensions and 8.1 Proof of Corollary 3.2

8.2 Reduction of k-means to k-median

8.3 Estimation of the Optimal Cost in a Tree

8.4 Extensions to Algorithm 1

References

Abstract

We study the theoretical and practical runtime limits of k-means and k-median clustering on large datasets. Since effectively all clustering methods are slower than the time it takes to read the dataset, the fastest approach is to quickly compress the data and perform the clustering on the compressed representation. Unfortunately, there is no universal best choice for compressing the number of points – while random sampling runs in sublinear time and coresets provide theoretical guarantees, the former does not enforce accuracy while the latter is too slow as the numbers of points and clusters grow. Indeed, it has been conjectured that any sensitivity-based coreset construction requires super-linear time in the dataset size.

We examine this relationship by first showing that there does exist an algorithm that obtains coresets via sensitivity sampling in effectively linear time – within log-factors of the time it takes to read the data. Any approach that significantly improves on this must then resort to practical heuristics, leading us to consider the spectrum of sampling strategies across both real and artificial datasets in the static and streaming settings. Through this, we show the conditions in which coresets are necessary for preserving cluster validity as well as the settings in which faster, cruder sampling strategies are sufficient. As a result, we provide a comprehensive theoretical and practical blueprint for effective clustering regardless of data size. Our code is publicly available at source and has scripts to recreate the experiments.

1 Introduction

The modern data analyst has no shortage of clustering algorithms to choose from but, given the ever-increasing size of relevant datasets, many are often too slow to be practically useful. This is particularly relevant for big-data pipelines, where clustering algorithms are commonly used for compression. The goal is to replace a very large dataset by a smaller, more manageable one for downstream tasks, with the hope it represents the original input well. Lloyd’s algorithm [49] was introduced for precisely this reason and minimizes the quantization error – the sum of square distance from each input point to its representative in the compressed dataset. Arguably the most popular clustering algorithm, Lloyd’s runs for multiple iterations until convergence with every iteration requiring O(ndk) time, where n is the number of points, d is the number of features and k is the number of clusters – or the size of the targeted compression. For such applications, the number of points can easily be hundreds of millions and, since the quality of compression increases with k, standard objectives can have k in the thousands [41, 4]. In such settings, any O(ndk) algorithm is prohibitively slow.

Examples like these have prompted the rise of big data algorithms that provide both theoretical and practical runtime improvements. The perspectives of theoretical soundness and practical efficacy are, however, often at odds with one another. On the one hand, theoretical guarantees provide assurance that the algorithm will work regardless of whatever unlucky inputs it receives. On the other hand, it may be difficult to convince oneself to implement the theoretically optimal algorithm when there are cruder methods that are faster to get running and perform well in practice.

Since datasets can be large in the number of points n and/or the number of features d, big-data methods must mitigate the effects of both. With respect to the feature space, the question is effectively closed as random projections are fast (running in effectively linear time), practical to implement [50], and provide tight guarantees on the embedding’s size and quality. The outlook is less clear when reducing the number of points n, and there are two separate paradigms that each provide distinct advantages. On the one hand, we have uniform sampling, which runs in sublinear time but may miss important subsets of the data and therefore can only guarantee accuracy under certain strong assumptions on the data [45]. On the other hand, the most accurate sampling strategies provide the strong coreset guarantee, wherein the cost of any solution on the compressed data is within an ε-factor of that solution’s cost on the original dataset [25].

Our contributions We study both paradigms (uniform sampling and strong coresets) with respect to a classic problem – compression for the k-means and k-median objectives. Whereas uniform sampling provides optimal speed but no worst-case accuracy guarantee, all available coreset constructions have a running time of at least Ω˜(nd + nk) when yielding tight bounds on the minimum number of samples required for accurate compression.

It is easy to show that any algorithm that achieves a compression guarantee must read the entire dataset[1]. Thus a clear open question is what guarantees are achievable in linear or nearly-linear time. Indeed, currently available fast sampling algorithms for clustering [6, 5] cannot achieve strong coreset guarantees. Recently, [31] proposed a method for strong coresets that runs in time O˜(nd + nk) and conjectured this to be optimal for k-median and k-means.

While this bound is effectively optimal for small values of k, there are many applications such as computer vision [34] or algorithmic fairness [18] where the number of clusters can be larger than the number of features by several orders of magnitude. In such settings, the question of time-optimal coresets remains open. Since the issue of determining a coreset of optimal size has recently been closed [25, 28, 44], this is arguably the main open problem in coreset research for center-based clustering. We resolve this by showing that there exists an easy-to-implement algorithm that constructs coresets in O˜(nd) time – only logarithmic factors away from the time it takes to read in the dataset.

Nonetheless, this does not fully illuminate the landscape among sampling algorithms for clustering in practice. Although our algorithm achieves both an optimal runtime and an optimal compression, it is certainly possible that other, cruder methods may be just as viable for all practical purposes. We state this formally in the following question: When are optimal k-means and k-median coresets necessary and what is the practical tradeoff between coreset speed and accuracy?

o answer this, we perform a thorough comparison across the full span of sampling algorithms that are faster than our proposed method. Through this we verify that, while these faster methods are sufficiently accurate on many real-world datasets, there exist data distributions that cause catastrophic failure for each of them. Indeed, these cases can only be avoided with a strong-coreset method. Thus, while many practical settings do not require the full coreset guarantee, one cannot cut corners if one wants to be confident in their compression. We verify that this extends to the streaming paradigm and applies to downstream clustering approaches.

In summary, our contributions are as follows:

• We show that one can obtain strong coresets for k-means and k-median in O˜(nd) time. This resolves a conjecture on the necessary runtime for k-means coresets [31] and is theoretically optimal up to log-factors.

• Through a comprehensive analysis across datasets, tasks, and streaming/non-streaming paradigms, we verify that there is a necessary tradeoff between speed and accuracy among the linear- and sublinear-time sampling methods. This gives the clustering practitioner a blueprint on when to use each compression algorithm for effective clustering results in the fastest possible time.

This paper is available on arxiv under CC BY 4.0 DEED license.

Up Next →

Coresets, Compression, and the Quest for Faster Data Clustering