Fair Clustering & Search

>100 Views

June 14, 24

スライド概要

Presenting my works@Osaka Trustworthy AI workshop

profile-image

I'm a DB and AI researcher.

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

関連スライド

各ページのテキスト
1.

Fair Clustering & Search Daichi Amagata Assistant Professor@Osaka University

2.

My Focus 1: Fairness No. 1 Fairness is clearly important, but too much definitions… [1,2] Counterfactual fairness? Group fairness? Individual fairness? Some others? ▷ Completely problem (or application) matters! (i.e., we do not consider a “whole cover”) Case 1: Clustering Case 2: Search Predicate 𝜃 Database 𝑋 𝑥 | 𝑥 ∈ 𝑋, 𝜃 𝑥 = 1 Group fairness? Individual fairness? or Size? [1] “A Review on Fairness in Machine Learning,” ACM CSUR, 2022. [2] “A Survey on Bias and Fairness in Machine Learning,” ACM CSUR, 2021. Group fairness? Ranking fairness? or Equality?

3.

My Focus 2: From “big” data to “small” data No. 2 𝑆: a set of 𝑘 (≪ 𝑛) points This dataset (𝑋) is too large for my application (or resource)… I wanna reduce the size to 𝑘. 𝑋: a set of 𝑛 points Case 1: Clustering 𝑆 Case 2: Search Predicate 𝜃 Database 𝑋 𝑥 | 𝑥 ∈ 𝑋, 𝜃 𝑥 = 1 k-center clustering Random sampling

4.

Fair k-center Clustering with Outliers Daichi Amagata To appear in AISTATS2024

5.

Background: Data Summarization No. 4 This dataset (𝑋) is too large for my application (or resource)… I wanna reduce the size to 𝑘. 𝑋: a set of 𝑛 points 𝑆 𝑆: a set of 𝑘 (≪ 𝑛) points • Sampling: Quite easy & the best if the data distribution should be preserved (but may be skewed). • Submodular maximization: Users must specify an appropriate function (which may not be trivial). • Diversity maximization: Easy to specify & the summary can avoid skewness.

6.

Background: k-center Clustering ∗ 𝑆 = arg min𝑆⊆𝑃, 𝑆 =𝑘 max𝑝∈𝑃 𝑑𝑖𝑠𝑡(𝑝, 𝑆) It’s diverse! No. 5

7.

Background: k-center Clustering + Outliers No. 6 ∗ 𝑆 = arg min𝑆⊆𝑃∖𝑃𝑜𝑢𝑡 , 𝑆 =𝑘 max𝑝∈𝑃∖𝑃𝑜𝑢𝑡 𝑑𝑖𝑠𝑡(𝑝, 𝑆) where |𝑃𝑜𝑢𝑡 | ≤ 1 + 𝜖 𝑧 It’s diverse!

8.

Background: k-center Clustering + Outliers cannot be fair No. 7 Example 𝑘-center clustering(only red centers) Make it “FAIR” Google image retrieval by “CEO” (only “male”) in past It’s diverse if centers are from • inliers and • required demographic groups.

9.

Preliminary No. 8 Problem Definition Input: 𝑃 (a set of 𝑛 points), 𝑘, 𝑘1, … , 𝑘𝑚 (𝑘 ≤ σ 𝑘𝑖 ), 𝑧 < 𝑛, and 𝜖 > 0 Output: 𝑆 ∗ = argmin𝑆⊆𝑃∖𝑃𝑜𝑢𝑡, 𝑆 =𝑘,∀𝑖∈ 1,𝑚 , 𝑆∩𝑃𝑖 ≤𝑘𝑖 fairness max 𝑑𝑖𝑠𝑡 𝑝, 𝑆 𝑝∈𝑃∖𝑃𝑜𝑢𝑡 ▷ NP-hard inliers First attempt: How about running • a state-of-the-art 𝑘, 1 + 𝜖 𝑧 -center clustering (e.g., [3]) • with an approach that selects centers only from groups that do not violate the fairness [4]? Claim (no approximation bound) The clustering quality of the above algorithm can be arbitrarily bad. [3] “Greedy strategy works for k-center clustering with outliers and coreset construction,” In ESA, 2019. [4] “Fair k-centers via maximum matching”, In ICML, 2020

10.

Our Techniques No. 9 Procedure 2𝑟 1. 𝑆 ← a random point ∈ 𝑃 ∗ 2. 𝑇 ← 𝑝 𝑝 ∈ 𝑃, 𝑑𝑖𝑠𝑡 𝑝, 𝑆 > 2𝑟} (𝑟 is a guess of 𝑟𝑘,𝑧,𝑚 ) If 𝑇 > 1 + 𝜖 𝑧, add a random point ∈ 𝑇 to 𝑆 Else break During these, build/update a cluster-group graph cluster-group graph (If a group belongs to a cluster, its center has an edge to this group.) 3. 𝑘1 𝑎, 𝑖 , … , (𝑏, 𝑗) ← solve the max-flow problem 4. Replace the cluster centers by using the above result 𝑠 𝑡 𝑘𝑚 Note • Step 2 guarantees 2(1 + 𝛾)-approximation (without fairness constraint). • Steps 3 guarantees the fairness constraint. • Step 4 guarantees 3(1 + 𝛾)-approximation (from the triangle inequality).

11.

Theoretical Results No. 10 Main theorem (what is possible): There is an 𝑂(𝑛𝑘) time algorithm that needs 𝑂(𝑛) space and yields a (3 + 𝛾)-approximation result for the fair (𝑘, 1 + 𝜖 𝑧)-center clustering problem with probability at least centers and removing at most 1 + 𝜖 𝑧 points. 𝑧 1− 𝑛 𝜖 𝑘−1 , by returning exactly 𝑘 1+𝜖 Proposition (what is impossible): For any 𝑆 such that 𝑆 ≤ 𝑘 and 𝑆 contains at least one outlier, it is not guaranteed that ∗ • we have the worst-case approximation bound of 𝑟𝑘,𝑧,𝑚 and • we can fix 𝑆 so that it certainly satisfies 𝑆 = 𝑘, the fairness constraint, and a guaranteed approximation ∗ bound of 𝑟𝑘,𝑧,𝑚 in polynomial time if P ≠ NP.

12.

Empirical Results No. 11 [Max radius] Our algorithms yield smaller errors. [Running time] Ours-RS is basically faster than the competitors. [Impact of k] Our algorithms are still superior to the competitors.

13.

Independent Range Sampling on Interval Data Daichi Amagata To appear in ICDE2024

14.

Background: Range Search on Interval Data • An interval 𝑥 = [𝑥. 𝑙, 𝑥. 𝑟] • A set of intervals 𝑋 = {𝑥1 , … , 𝑥𝑛 } • A query interval 𝒒 = [𝒒. 𝒍, 𝒒. 𝒓] finds all intervals ∩ 𝒒 in 𝑿. ▷ Ex.1) Vehicle management systems: Show vehicles that were active between 17:00 and 22:00 a week ago. ▷ Ex. 2) Book management systems or libraries: Collect books that were sold or borrowed in the last month. ▷ Ex. 3) Historical cryptocurrency databases: Show when the price of Bitcoin (BTC) falls in [30,000 40,000] dollars. No. 13

15.

Background: Too Large Result Set • An interval 𝑥 = [𝑥. 𝑙, 𝑥. 𝑟] • A set of intervals 𝑋 = {𝑥1 , … , 𝑥𝑛 } • A query interval 𝒒 = [𝒒. 𝒍, 𝒒. 𝒓] finds all intervals ∩ 𝒒. ▷ Ex.1) Vehicle management systems: Show vehicles that were active between 17:00 and 22:00 a week ago. randomly sampled vehicles Still 10,000,000 intervals! query interval 𝑞 Database 𝑋 of a billion intervals 𝑥 | 𝑥 ∈ 𝑋, 𝑥 ∩ 𝑞 𝛀 𝒙|𝒙 ∈ 𝑿, 𝒙 ∩ 𝒒 time! No. 14

16.

Preliminary No. 15 Problem Definition Input: 𝑋 (a set of 𝑛 intervals), 𝑞 (query interval), and 𝑠 (#samples) Output: 𝑆 (a set of 𝑠 intervals, each of which is picked from 𝑞 ∩ 𝑋 uniformly at random) First attempt: How about • running a state-of-the-art range search algorithm (e.g., [3]) • and then do random sampling? Claim (no merit) The above algorithm incurs Ω 𝑞 ∩ 𝑋 = 𝑂(𝑛) time. [5] “Hint: A hierarchical index for intervals in main memory,” In SIGMOD, 2022.

17.

Out Techniques: Augmented Interval Tree Each node maintains the intervals • overlapping its coordinate • existing the subtree rooted at this node No. 16 Binary searches of 𝑂(𝑛)-sized arrays at most 𝑂(log 𝑛) times. (The height is ≤ log 𝑛.)

18.

Theoretical and Empirical Results No. 17 Main theorem: ෨ ෨ There is a 𝑂(𝑠) time algorithm that needs 𝑂(𝑛) space and picks a sample with the following probability. • 1 (non-weighted case) 𝑞∩𝑋 • 𝑤 𝑥 σ ′ 𝑤 𝑥′ 𝑥 ∈𝑞∩𝑋 (weighted case) satisfying “uniformly at random” sampling probability

19.

Summary (Takeaway) ◼ Fairness when reducing data size  k-center clustering  Range search on interval data  Source codes are available at https://github.com/amgt-d1 No. 18