Diversity Maximization in the Presence of Outliers@AAAI2023

132 Views

June 13, 24

#最大最小多様化 #外れ値 #近似アルゴリズム #高速化 #データ要約

スライド概要

Presenting our work "Diversity Maximization in the Presence of Outliers" at AAAI2023.

Daichi Amagata

@8828547526

スライド一覧

I'm a DB and AI researcher.

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

Fair Clustering & Search 2

Daichi Amagata 639

Independent Range Sampling on Interval Data@ICDE2024

Daichi Amagata 388

2022-DBS-amagata

Daichi Amagata 250

Random Sampling over Spatial Range Joins@ICDE2025

Daichi Amagata 148

Fair Clustering & Search

Daichi Amagata 108

Fast and Exact Outlier Detection in Metric Spaces: A Proximity Graph-based Approach@SIGMOD2021

Daichi Amagata >100

各ページのテキスト

DIVERSITY MAXIMIZATION IN THE PRESENCE OF OUTLIERS Daichi Amagata Osaka University

BACKGROUND: MAKING A SUMMARY OF A DATASET This dataset (𝑋) is too large for my application (or resource)… I wanna reduce the size to 𝑘. 𝑋: a set of 𝑛 points 𝑆 𝑆: a set of 𝑘 (≪ 𝑛) points • Sampling: Quite easy & the best if the data distribution should be preserved (but may be skewed). • Submodular maximization: Users must specify an appropriate function (which may not be trivial). • Diversity maximization: Easy to specify & the summary can avoid skewness. #1

BACKGROUND: DIVERSITY MAXMIZATION ∗ 𝑆 = arg max𝑆⊆𝑋, 𝑆 =𝑘 𝒇 𝑺 ′ MAX-MIN: 𝑓 𝑆 = min 𝑑𝑖𝑠𝑡 𝑥, 𝑥 ′ 𝑥,𝑥 ∈𝑆 It’s diverse! MAX-SUM: 𝑓 𝑆 = σ𝑥,𝑥 ′ ∈𝑆 𝑑𝑖𝑠𝑡 𝑥, 𝑥 ′ It’s skewed… #2

BACKGROUND: DIVERSITY MAXMIZATION + OUTLIERS ∗ 𝑆 = arg max𝑆⊆𝑋, 𝑆 =𝑘 𝒇 𝑺 ′ MAX-MIN: 𝑓 𝑆 = min 𝑑𝑖𝑠𝑡 𝑥, 𝑥 ′ 𝑥,𝑥 ∈𝑆 It’s diverse! #3

PRELIMINARY Problem Definition Input: 𝑋 = 𝑋𝑖𝑛 ∪ 𝑋𝑜𝑢𝑡 (a set of 𝑛 points in a metric space) and 𝑘 ≥ 2 (output size) ′ Output: 𝑆 ∗ = arg max𝑆⊆𝑋∖𝑿𝒐𝒖𝒕, 𝑆 =𝑘 min 𝑑𝑖𝑠𝑡 𝑥, 𝑥 ′ ▷ NP-hard 𝑥,𝑥 ∈𝑆 Assumptions Same as 𝑘-clustering Outliers are far from the other points. 1) 𝑋𝑜𝑢𝑡 = 𝑧 ′ 2) Define 𝑑𝑖𝑠𝑡 𝑥, 𝑋 = min 𝑑𝑖𝑠𝑡(𝑥, 𝑥 ). For each 𝑥 ∈ 𝑋𝑜𝑢𝑡 , we have ′ 𝑥 ∈𝑋 for every 𝑥 ′ ∈ 𝑋𝑖𝑛 • 𝑑𝑖𝑠𝑡 𝑥, 𝑋 ∖ 𝑥 > 𝑑𝑖𝑠𝑡 𝑥 ′ , 𝑋 ∖ 𝑥 ′ • 𝑑𝑖𝑠𝑡 𝑥, 𝑋 ∖ 𝑥 ′ > 𝛼𝑙 ∗ , where 𝛼 ≥ 1 and 𝑙 ∗ = min 𝑑𝑖𝑠𝑡 𝑥, 𝑥 ′ ∗ 𝑥,𝑥 ∈𝑆 #4

SOTA ALGO. FOR MAX-MIN DIV. W/O OUTLIERS IS NOT EFFECTIVE GMM [1]: A state-of-the-art 𝑂(𝑛𝑘) time ½-approximation algorithm if 𝑋𝑜𝑢𝑡 = ∅ 1) Add a random point in 𝑋 into 𝑆 ▷ 𝑂 1 time 2 to 𝑘) Add 𝑥 ∗ = arg max𝑥∈𝑋∖𝑆 𝑑𝑖𝑠𝑡 𝑥, 𝑆 into 𝑆 ▷ 𝑂 𝑛(𝑘 − 1) time 4 It’s diverse! 7 3 It’s dirty… 10 1 8 6 9 2 5 [1] Ravi, S. S.; Rosenkrantz, D. J.; and Tayi, G. K. 1994. Heuristic and Special Case Algorithms for Dispersion Problems. Operations Research, 42(2): 299–310. #5

CONTRIBUTION TO THE PROBLEM OF MAX-MIN DIV. w/ OUTLIERS Challenge Does there exist an algorithm that can return 𝑆 containing no outliers in time proportional to 𝑛, 𝑘, and 𝑧? Findings We prove that there exists an • • 𝑂 𝑘 + 𝑧 𝑛 time ½-approximation algorithm that returns no outliers 1 𝑂 𝑘𝑧 time -approximation algorithm that returns no outliers with a constant probability 6 1+𝜖 Experiments We demonstrate that • Existing Max-Min diversification algorithms’ solutions contain most outliers. • Our algorithms are much (e.g., four orders of magnitude) faster than a baseline algorithm. #6

BASELINE ALGORITHM: 𝑶(𝒏𝟐 ) TIME, ½ -APPROX., & NO OUTLIERS Operations 1) Compute the NN (nearest neighbor) of each 𝑥 ∈ 𝑋. ▷ 𝑶 𝒏𝟐 time 2) Remove 𝑧 points with the largest NN distance. ▷ 𝑂 𝑧 time 3) Run GMM on the remaining points. ▷ 𝑂 𝑛 − 𝑧 𝑘 time Assumptions The 𝑧 points with the largest distance to the nearest neighbor are outliers. 1) 𝑋𝑜𝑢𝑡 = 𝑧 ′ 2) Define 𝑑𝑖𝑠𝑡 𝑥, 𝑋 = min 𝑑𝑖𝑠𝑡(𝑥, 𝑥 ). For each 𝑥 ∈ 𝑋𝑜𝑢𝑡 , we have ′ 𝑥 ∈𝑋 for every 𝑥 ′ ∈ 𝑋𝑖𝑛 • 𝑑𝑖𝑠𝑡 𝑥, 𝑋 ∖ 𝑥 > 𝑑𝑖𝑠𝑡 𝑥 ′ , 𝑋 ∖ 𝑥 ′ • 𝑑𝑖𝑠𝑡 𝑥, 𝑋 ∖ 𝑥 > 𝛼𝑙 ∗ , where 𝛼 ≥ 1 and 𝑙 ∗ = 𝑑𝑖𝑣 𝑆 ∗ #7

OUTLIER-AWARE GREEDY ALGORITHM Lemma 1 𝑆 ′ ← GMM 𝑋, 𝑘 + 𝑧 contains the 𝑧 outliers. Operations 1) 𝑆 ′ ← GMM 𝑋, 𝑘 + 𝑧 ▷ 𝑂 𝑛(𝑘 + 𝑧) time 2) Compute the NN of each 𝑥 ∈ 𝑆′ ▷ 𝑂 𝑛(𝑘 + 𝑧) time 3) 𝑍 ← 𝑧 points with the largest NN dist. 4) 𝑆 ← GMM(𝑋 ∖ 𝑍, 𝑘) ▷ 𝑂 𝑧 time ▷ 𝑂 𝑛 − 𝑧 𝑘 time Result of GMM 𝑋, 𝑘 + 𝑧 Theorem 1 GREEDY returns a ½-approx. solution with no outliers in 𝑂(𝑛(𝑘 + 𝑧)) time. #8

10.

STREAMING ALGORITHM መ 𝑙/2 𝑂( 𝑋 ′ 𝑘) time 𝑂(1) iterations because 𝐿 = 𝑂 1 Lemma 2 STREAMING(𝑋 , 𝑘) runs in 𝑂( 𝑋 𝑘) time and returns a ′ ′ no outliers with probability 1 − * 𝑙∗ = 𝑓 𝑆 ∗ = min 𝑑𝑖𝑠𝑡 𝑥, 𝑥 ′ ′ ∗ 𝑥,𝑥 ∈𝑆 𝑧 𝑋′ if 𝛼 ≥ 2. 1 2 1+𝜖 -approx. solution with #9

11.

CORESET FOR MAX-MIN DIVERSIFICATION WITH OUTLIERS Definition of Coreset 𝐶 ⊆ 𝑋 is a 𝛽-coreset, if, for 𝑆 ⊆ 𝐶, we have 𝑑𝑖𝑣 𝑆 ≥ 𝑑𝑖𝑣 𝑆 ∗ 𝛽 . Lemma 3 𝐶 ← GMM 𝑋, 𝑘 + 𝑧 is a 3-coreset (it yields the anti-cover property even in the presence of outliers). Theorem 2 (main result) Built in an offline phase 1 STREAMING(𝐶, 𝑘), where 𝐶 = 𝑂 𝑧 ≥ 𝑘 + 𝑧, runs in 𝑂(𝑘𝑧) time and returns a -approx. 6 1+𝜖 solution with no outliers with a constant probability if 𝛼 ≥ 2 (1 − 𝐶𝑧 = 𝑂 1 ⟺ 𝐶 = 1−𝑂𝑧 1 ). We can make a summary 𝑺 of 𝑿 with online time independent of 𝒏. # 10

12.

EXPERIMENT: COMPARISON WITH COMPETITORS (𝒌 = 𝟏𝟎𝟎, 𝒛 = 𝟐𝟎𝟎) Table 1: Average number of outliers in 𝑆 𝑆 of the existing algorithms (GMM & PODS19) is dominated by outliers… Table 2: Average 𝑑𝑖𝑣 𝑆 = min 𝑑𝑖𝑠𝑡 𝑥, 𝑥 ′ and running time [sec] (no outliers are returned) ′ 𝑥,𝑥 ∈𝑆 Our algorithms are faster than BASELINE. Especially, CORESET is super fast. Table 3: Average time to identify 𝑧 outliers [sec] GREEDY identifies the outliers much faster than BASELINE. * FCT: 10-dim & 𝑛 = 580,812, Household: 7-dim & 𝑛 = 2,049,280, KDD99: 16-dim & 𝑛 = 311,029, Mirai: 115-dim & 𝑛 = 764,137 # 11

13.

EXPERIMENT: IMPACTS OF 𝒌 & 𝒛 # 12

14.

CONCLUSION Problem: First study on Max-Min diversification with outliers Technical Contribution: We proposed • GREEDY: 𝑂 𝑘 + 𝑧 𝑛 time ½-approx. algorithm that returns no outliers • CORESET: 𝑂 𝑘𝑧 time 1/6 1 + 𝜖 -approx. algorithm that returns no outliers with a constant probability Experiments: We demonstrated that • Existing Max-Min diversification algorithms’ solutions contain most outliers. • Our algorithms are much (e.g., four orders of magnitude) faster than a baseline algorithm. Source code available at https://github.com/amgt-d1/Max-Min-w-Outliers # 13