2022-DBS-amagata

データベース技術をいろんな分野へ！ー推薦システム編ー大阪大学大学院情報科学研究科助教天方大地

自己紹介 ◼ ◼ ◼ 名前：天方大地（あまがただいち）略歴： 2015年大阪大学大学院情報科学研究科博士後期課程修了同年大阪大学大学院情報科学研究科助教 2019年 JSTさきがけ研究員兼任研究テーマ： Fast algorithms for databases, data mining, and machine learning techniques  データベース系：EDBT’16, TKDE’17,18,22, ICDE’19, SIGMOD’21, VLDB Journal’22  位置情報サービス系：SIGSPATAIL’20,21,22  推薦システム系：RecSys’21,22,  人工知能系：AAAI’23 #1

3.

講演の前に… ◼ プレゼンターはデータベース分野を主軸にしています．  ◼ 推薦システムの研究者ではないです（し，素人です）．内容は，「検索の定式化」＋「高速アルゴリズム」です．  推薦精度向上のための仕組み等に関する研究ではありません（がTwitterで取り上げてくれた方も）．  RecSysで似た研究（高速アルゴリズム系）は，おそらく 2014年の Microsoft Research の論文以来 #2

4.

RecSysってどんな会議？ ◼ その名の通り，推薦システム全般を扱う国際会議  ◼ モデリング（学習），インタラクション，公正性・プライバシ，ドメイン適用，スケーラビリティ，etc. 特徴  企業からの発表（Research Track）が多い！例）NVIDIA，Netflix，Adobe，Alibaba，Visa，Amazon，Google，Microsoft，Tencent，etc.  スポンサーがすごい！（SIGMODと同じような注目度）  シングルセッション  採択率きつい（2021: 18.4%, 2022: 16.9%）  参加者多い（2022: >1100名）← SIGMOD2022より多い  （DBより）デプロイのイメージがしやすい研究が多い #3

5.

REVERSE MAXIMUM INNER PRODUCT SEARCH: HOW TO EFFICIENTLY FIND USERS WHO WOULD LIKE TO BUY MY ITEM? Daichi Amagata1,2 and Takahiro Hara1 1. Osaka University 2. JST PRESTO

6.

BACK GROUND: MATRIX FACTORIZATION Rating Matrix: 𝑝1 User & Item Matrices <- we have these. Items 𝑝2 𝑝3 𝑝4 𝑝5 … 𝑑 𝑝𝑚 𝑚 𝑢1 𝑢2 Users 𝑢3 𝑛 Users x Items 𝑑 … 𝑢𝑛 Motivation: [1] empirically shows that MF-based recommender systems are better than Neural CF-based ones. [1] Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson, Neural Collaborative Filtering vs. Matrix Factorization Revisited, In RecSys, 2020. #5

7.

BACK GROUND: MAXIMUM INNER PRODUCT SEARCH Problem 1 (MIPS): Given a user vector 𝐮 and an item matrix 𝐏, MIPS finds 𝐩∗ = arg max𝐩∈𝐏 𝐮 ⋅ 𝐩 𝑢1 𝑢2 𝑢3 𝑝1 𝑝2 𝑝3 𝑝4 𝑝5 4.1 5.0 2.5 1.8 3.0 … 𝑝𝑚 0.2 𝑚 𝑑 𝐮𝟏 x P (item vectors) 𝑑 … 𝑢𝑛 𝐩∗ = 𝐩𝟐 Note: This corresponds to ✓ making a recommendation list for a given user. <- a “user-driven” service #6

8.

BACK GROUND: HOW TO FIND USERS WHO MAY BUY A GIVEN ITEM? Question: For a given item 𝐩, which user vectors 𝐮 do satisfy 𝐩 = arg max𝐩′ ∈𝐏 𝐮 ⋅ 𝐩′ ? 𝑝1 𝑝2 𝑝3 𝑝4 𝑝5 𝑢1 4.1 5.0 2.5 1.8 3.0 𝑢2 4.9 4.3 2.9 4.5 2.1 𝑢3 2.3 4.6 4.1 3.1 1.5 𝑢4 3.3 4.1 3.4 3.6 1.9 𝑢5 2.1 3.9 2.6 4.3 3.1 𝑑 𝑚 𝑛 Users x Items 𝑑 Note: This corresponds to ✓ making advertisements of a given item to users who prefer it. <- an “item-driven” service • Easy to know targets for new items, restocked items, etc. ✓ knowing market size (about 𝐩). #7

9.

[beta]

REVERSE MAXIMUM INNER PRODUCT SEARCH
DEFINITION 1 (k-MIPS problem):
• Input: a set of vectors P, a query vector q, and k,
• Output: k vectors in P that have the highest inner products with q.

DEFINITION 2 (Reverse k-MIPS problem):
• Input: a query (item) vector 𝐪, k, and two sets of vectors 𝐐 (set of user vectors) and 𝐏
(set of item vectors)
• Output: all vectors 𝐮 ∈ 𝐐 s.t. q is included in the k-MIPS result of 𝐮 among 𝐏 ∪ {𝐪}.

How to solve the reverse k-MIPS problem?
• For each user vector in 𝐐, run state-of-the-art k-MIPS algorithm [2,3] on 𝐏 ∪ {𝐪}?
• The above approach incurs 𝑂 𝑛𝑚𝑑 time.
For large scale recommender systems, 𝑛 (#users) and 𝑚 (#items)
are both large, so this is impractical!
[2] Hui Li, Tsz Nam Chan, Man Lung Yiu, and Nikos Mamoulis, FEXIPRO: fast and exact inner product retrieval in recommender systems, In SIGMOD, 2017.
[3] Christina Teflioudi, Rainer Gemulla, and Olga Mykytiuk, LEMP: Fast retrieval of large entries in a matrix product, In SIGMOD, 2015.

#8

10.

CONTRIBUTION 1. We propose the reverse k-MIPS problem. 2. We propose Simpfer, a simple, fast and exact algorithm for reverse k-MIPS. • Simpfer is easy to deploy, if you use a MF-based recommender system. • Source code is available at https://github.com/amgt-d1/Simpfer. 3. We analyze Simpfer theoretically, and its time complexity is 𝑶 𝟏 − 𝜶 𝒏𝒎′ 𝒅 . • 𝛼: pruning ratio of user vectors. • 𝑚′: the average number of accessed item vectors (empirically 𝑚′ = 𝑂 𝑘 ). 4. We analyze Simpfer empirically on four real datasets. • Simpfer is two orders of magnitude faster than baseline algorithms. • For a dataset with 2 million users, Simpfer needs 0.5[sec], while the baselines need a half hour! #9

https://github.com/amgt-d1/Simpfer

11.

[beta]

HOW TO SPEED-UP REVERSE k-MIPS PROCESSING?
Straightforward solution: for each 𝐮 ∈ 𝐐, run k-MIPS on 𝐏 ∪ {𝐪}
𝑚

𝑛

𝐐:

x

𝐏: items

Strong points:
𝐪

users

•

The result is correct.

•

Some item vectors can be skipped by using [2,3].

Weak points:
These white vectors include 𝐪
as k-MIPS result.

•

All user vectors must be investigated.

•

Most user vectors are not included in the result.

Our Idea:
1.

User vectors that are not included in the reverse k-MIPS result should be ignored.

2.

We don’t need the k-MIPS result (i.e., vectors), and just Yes/No answer for the k-MIPS result is sufficient.

-> How to implement these ideas?
[2] Hui Li, Tsz Nam Chan, Man Lung Yiu, and Nikos Mamoulis, FEXIPRO: fast and exact inner product retrieval in recommender systems, In SIGMOD, 2017.
[3] Christina Teflioudi, Rainer Gemulla, and Olga Mykytiuk. LEMP: Fast retrieval of large entries in a matrix product, In SIGMOD, 2015.

# 10

12.

PRE-PROCESSING Lines 1-6: General data structure 𝐐 𝐮𝟏 𝐮𝟐 𝐮𝟑 𝐮𝟒 𝐮𝟓 𝐮𝟔 𝐮𝟕 𝐮𝟖 … 𝐮𝒏 𝐏 𝐩𝟏 𝐩𝟐 𝐩𝟑 𝐩𝟒 𝐩𝟓 𝐩𝟔 𝐩𝟕 𝐩𝟖 … 𝐩𝒎 norm: 𝐯 𝐏 ′ : the first 𝑂 𝑘𝑚𝑎𝑥 vectors in 𝐏 Lines 7-10: Lower-bound arrays (𝑘𝑚𝑎𝑥 is the maximum 𝑘) 𝐿1 4.5 4.1 3.9 … 3.2 … … … … … … 𝐿𝑛 4.8 4.7 4.2 … 3.8 ← 𝑘𝑚𝑎𝑥 -MIP in 𝐮𝟏 × 𝐏 ′ (≤ 𝐩∗ ) ← 𝑘𝑚𝑎𝑥 -MIP in 𝐮𝒏 × 𝐏 ′ Lines 11-18: Blocks (the block size is 𝑂(log 𝑛)) 𝐮𝟏 Only 𝑂 𝑛 𝑑 + log 𝑛 𝐮𝟐 𝐮𝟑 𝐮𝟒 𝐮𝟓 𝐮𝟔 𝐁𝟏 time & 𝑂 𝑛 space. 𝐿 𝐁𝟏 4.3 4.1 𝐮𝟕 𝐮𝟖 … 𝐮𝒏 𝐁𝟐 3.5 … 2.9 𝑗 ← 𝐿𝑗 𝐁 = min 𝐿𝑖 𝐮𝐢 ∈𝐁 … … … … … # 11

13.

UPPER- & LOWER-BOUNDING Assume that we have 𝐮𝒊 ⋅ 𝐪. LEMMA 1: If 𝐮𝒊 ⋅ 𝐪 ≤ 𝐿𝑘𝑖 , 𝐪 is not in the k-MIPS result of 𝐮𝒊 . <- 𝑂 1 time for saying “No”. LEMMA 2: If 𝐮𝒊 ⋅ 𝐪 ≥ 𝐮𝒊 𝐩𝒌 , 𝐪 is in the k-MIPS result of 𝐮𝒊 . <- 𝑂 1 time for saying “Yes”. LEMMA 3: Let 𝐮𝒊 be the first vector of 𝐐(𝐵). If 𝐮𝒊 𝐪 ≤ 𝐿𝑘 (𝐵), ∀𝐮 ∈ 𝐐(𝐵), <- 𝑂 1 time for saying “No”, 𝐪 is not in the k-MIPS result of 𝐮. ∀𝐮 ∈ 𝐐(𝐵) in a batch. If 𝐮𝒊 does not have Lemmas 1-3, consider a sequential scan of 𝑃. Assume that the current top-k threshold is 𝜏. 𝐩𝟏 𝐩𝟐 𝐩𝟑 𝐩𝟒 𝐩𝟓 𝐩𝟔 𝐩𝟕 𝐩𝟖 … 𝐩𝒎 COROLLARY 1: Assume that 𝐪 is in an intermediate k-MIPS result of 𝐮𝑖 , and we now <- Early stop the scan COROLLARY 2: If 𝐮𝒊 ⋅ 𝐪 < 𝜏, 𝐪 is not in the k-MIPS result of 𝐮𝒊 . <- Early stop the scan evaluate 𝐩𝑗 . If 𝐮𝒊 ⋅ 𝐪 (≥ 𝜏) ≥ 𝐮𝒊 𝐩𝒋 , 𝐪 is in the k-MIPS result of 𝐮𝒊 . # 12

14.

SIMPFER: SIMPLE, FAST, AND EXACT FOR REVERSE k-MIPS <- Early stop (Corollary 1) <- Batch pruning (Lemma 3) <- Lower-bounding (Lemma 1) <- Upper-bounding (Lemma 2) <- Early stop (Corollary 2) Only 𝑂 𝑚′ 𝑑 time, where 𝑚′ ≤ 𝑚 𝑶 𝟏 − 𝜶 𝒏𝒎′ 𝒅 time, where 𝛼 is pruning rate of 𝐐 (almost 𝑘 in practice). # 13

15.

EXPERIMENTAL RESULT 1: IMPACT OF 𝒌 MovieLens (𝑛 = 138,493, 𝑚 = 26,744), Netflix (𝑛 = 480,189, 𝑚 = 17,770), Yahoo! (𝑛 = 2,088,620, 𝑚 = 200,941), Amazon (𝑛 = 1,948,882, 𝑚 = 98,211) # 14

16.

EXPERIMENTAL RESULT 2: IMPACT OF 𝒏 (#USERS) MovieLens (𝑛 = 138,493, 𝑚 = 26,744), Netflix (𝑛 = 480,189, 𝑚 = 17,770), Yahoo! (𝑛 = 2,088,620, 𝑚 = 200,941), Amazon (𝑛 = 1,948,882, 𝑚 = 98,211) # 15

17.

EXPERIMENTAL RESULT 3: IMPACT OF 𝒎 (#ITEMS) MovieLens (𝑛 = 138,493, 𝑚 = 26,744), Netflix (𝑛 = 480,189, 𝑚 = 17,770), Yahoo! (𝑛 = 2,088,620, 𝑚 = 200,941), Amazon (𝑛 = 1,948,882, 𝑚 = 98,211) # 16

18.

Summary 1. We proposed the reverse k-MIPS problem. 2. We proposed Simpfer for reverse k-MIPS. • Source code is available at https://github.com/amgt-d1/Simpfer. 3. We analyzed Simpfer theoretically, and its time complexity is 𝑂 1 − 𝛼 𝑛𝑚′ 𝑑 . 4. We analyzed the empirical efficiency of Simpfer on four real datasets. # 17

https://github.com/amgt-d1/Simpfer

19.

SOLVING DIVERSITY-AWARE MAXIMUM INNER PRODUCT SEARCH EFFICIENTLY AND EFFECTIVELY Kohei Hirata1, Daichi Amagata1, Sumio Fujita2, Takahiro Hara1 1. Osaka University 2. Yahoo Japan Corporation

20.

𝒌-Maximum Inner Product Search 𝑘-MIPS: retrieves 𝑘 vectors having the highest inner product with a query vector 𝑚 𝑑 𝐪𝟏 x P (item vectors) 2-MIPS result: {𝐩𝟐 , 𝐩𝟒 } = recommendation list 𝑑 𝐩𝟒 … 𝐩𝒏 1.0 4.9 3.3 4.9 … 2.3 𝐩𝟏 𝐪𝟏 𝐩𝟐 𝐩𝟑 Drawback of simply employing 𝑘-MIPS: ◼ Lists tend to contain similar items (low opportunity of getting “new notices”). ◼ They consider only preference, and diversity is not considered at all. # 19

21.

Diversity-aware 𝒌-MIPS: Problem Definition Input: a query vector 𝐪, a set 𝐏 of item vectors, an output size 𝑘 ≥ 1, and a diversity degree 𝜆 ∈ [0,1] 𝝀 𝒌 Output: 𝐒∗ = 𝐚𝐫𝐠 𝐦𝐚𝐱 𝒇 𝐒 𝒔. 𝒕. 𝒇 𝐒 = σ𝐩∈𝐒 𝐩 ⋅ 𝐪 + 𝒄 𝟏 − 𝝀 𝐦𝐢𝒏𝐩,𝐩′ ∈𝐒 𝒅𝒊𝒔𝒕 𝐩, 𝐩′ 𝐒⊆𝐏, 𝐒 =𝒌  Inner products (𝐩 ⋅ 𝐪) are computed from the vectors generated by Matrix Factorization.  Distances (𝑑𝑖𝑠𝑡) are computed from the vectors generated by Item2Vec [1].  𝑐: a scaling constant Semantic of the problem: i. 𝐒∗ contains items having high inner products with 𝐪. ii. The items in 𝐒∗ are dissimilar to each other (i.e., they are diverse). iii. Users can control the degree of diversity through 𝜆. [1] Item2vec neural item embedding for collaborative filtering In MLSP (ICASSP workshop) 2016. # 20

22.

Case Study (by MovieLens dataset) 𝜆 = 1 (no diversification) Order 1 Title Harry Potter and the Deathly Hallows Part 2 (Fantasy) 𝜆 = 0.5 (50% diversification) IP 3.50 2 Harry Potter and the Deathly Hallows Part 1 3.48 3 Harry Potter and the Half-Blood Prince 3.46 4 Harry Potter and the Order of the Phoenix 3.45 5 Harry Potter and the Goblet of Fire 3.40 Title Harry Potter and the Deathly Hallows Part 2 (Fantasy) Darwyn Cooke‘s Batman 75th Anniversary (Animation) A Midsummer Night‘s Dream (Comedy) Jimmy Carr: Live (Comedy) Went the Day Well? (Thriller) IP 3.50 3.37 3.31 3.29 3.21 # 21

23.

Greedy Algorithm: The diversity-aware 𝑘-MIPS problem is NP-hard Procedures: 1. 2. Init of 𝐒: add the item having the highest inner product with 𝐪 into 𝐒. Update of S: compute 𝑓 𝐩, 𝐒 for each 𝐩 ∈ 𝐏 ∖ 𝐒, then 𝐩∗ = argmax 𝑓 𝐩, 𝐒 is added into 𝐒. 𝑓 𝐩, 𝐒 = 𝜆 𝐩 ⋅ 𝐪 + 𝑐 1 − 𝜆 3. min 𝐩𝑎 ,𝐩𝑏 ∈𝐒∪ 𝐩 𝑑𝑖𝑠𝑡 𝐩𝑎 , 𝐩𝑏 𝐩∈𝐏∖𝐒 Repeat step-2 until 𝐒 = 𝑘. Step 1 Scan whole 𝐏 𝐏 𝐩𝟏 𝐩𝟐 𝐩𝟑 𝐒 (result set) … 𝐩𝒋 … 𝐩𝒏 𝐩𝟏 (𝐩𝑖 : the item added in the 𝑖-th iteration) # 22

24.

Greedy Algorithm: The diversity-aware 𝑘-MIPS problem is NP-hard Procedures: 1. 2. Init of 𝐒: add the item having the highest inner product with 𝐪 into 𝐒. Update of S: compute 𝑓 𝐩, 𝐒 for each 𝐩 ∈ 𝐏 ∖ 𝐒, then 𝐩∗ = argmax 𝑓 𝐩, 𝐒 is added into 𝐒. 𝑓 𝐩, 𝐒 = 𝜆 𝐩 ⋅ 𝐪 + 𝑐 1 − 𝜆 3. min 𝐩𝑎 ,𝐩𝑏 ∈𝐒∪ 𝐩 𝐩∈𝐏∖𝐒 𝑑𝑖𝑠𝑡 𝐩𝑎 , 𝐩𝑏 Repeat step-2 until 𝐒 = 𝑘. Steps 2 & 3 Scan whole 𝐏 𝐏 𝐒 𝐩𝟏 𝐩𝟐 𝐩𝟑 𝐩𝟏 … 𝐩𝒋 … 𝐩𝒏 𝐒 (result set) 𝐩∗ 𝑓 𝐩, 𝐒 comp. 𝐩𝟏 𝐩𝟐 (𝐩𝑖 : the item added in the 𝑖-th iteration) (involves 𝑑𝑖𝑠𝑡 comp.) # 23

25.

Greedy Algorithm: Drawback Time complexity: 𝑂(𝑛𝑘) Bottleneck: computing 𝑓 𝐩, 𝐒 for each * 𝐩 ∈ 𝐏 ∖ 𝐒 *In each iteration, we wanna have a single vector 𝐩∗ , but the algorithm accesses 𝑂(𝑛) vectors… Steps 2 & 3 Scan whole 𝐏 𝐏 𝐒 𝐩𝟏 𝐩𝟐 𝐩𝟑 𝐩𝟏 … 𝐩𝒋 … 𝐩𝒏 𝐒 (result set) 𝐩∗ 𝑓 𝐩, 𝐒 comp. 𝐩𝟏 𝐩𝟐 (𝐩𝑖 : the item added in the 𝑖-th iteration) (involves 𝑑𝑖𝑠𝑡 comp.) # 24

26.

IP-Greedy (our algorithm): Idea Identifying 𝐩∗ without computing 𝒇 𝐩, 𝐒 for each 𝐩 ∈ 𝐏 ∖ 𝐒 Early stop: We can early stop the scan when we know that 𝐩∗ does not exist among the unseen vectors. stop Scan 𝐏 𝐩𝟏 𝐩𝟐 𝐩𝟑 … 𝐩𝒋 𝐩𝒏 𝐩∗ = 𝐩∗𝑖𝑛𝑡 Item level skip: We can skip computing 𝑓 𝐩, 𝐒 if we know 𝐩 ≠ 𝐩∗ . skip Scan 𝐏 𝐩𝟏 𝐩𝟐 𝐩𝟑 … 𝐩𝒋 … 𝐩𝒏 # 25

27.

IP-Greedy: Theory Assumption: 𝐩𝑗 ∈ 𝐏 is sorted by descending order of 𝛾𝑗 = min( 𝐩𝑗 𝐼 𝐩𝑗 ⋅ 𝐪 𝐪 ) 𝐼 𝐩𝑗 , 𝐪 : norm of vector obtained by MF (which can be computed offline) Theorem 1 (early stop): If 𝐩𝒋 ∈ 𝐏 ∖ 𝐒 has 𝐦𝐢𝐧 𝒅𝒊𝒔𝒕 𝐩𝒂 , 𝐩𝒃 ≤ 𝒇 𝐩∗𝒊𝒏𝒕 , 𝐒 − 𝝀 𝜸𝒋 𝐪 𝒄 𝟏−𝝀 𝐩𝒂 ,𝐩𝒃 ∈𝐒 (𝐩∗𝑖𝑛𝑡 : intermediate 𝐩∗ ) stop Scan 𝐏 𝐩𝟏 𝐩𝟐 𝐩𝟑 , then 𝐩𝒋 , … , 𝐩𝒏 ≠ 𝐩∗ … 𝐩𝒋 𝐩∗ = 𝐩∗𝑖𝑛𝑡 𝐩𝒏 larger 𝛾 Theorem 2 (item level skip): If 𝐩𝒋 ∈ 𝐏 ∖ 𝐒 has 𝐩𝐣 𝑴 + 𝐦𝐢𝐧 𝐩 𝐩∈𝐒 𝑴 ≤ 𝒇 𝐩∗𝒊𝒏𝒕 , 𝑺 − 𝝀 𝜸𝒋 𝒒 𝒄 𝟏−𝝀 , then 𝐩𝒋 ≠ 𝐩∗ ( 𝐩 𝑴 : norm of vector obtained by Item2Vec) skip Scan 𝐏 𝐩𝟏 𝐩𝟐 𝐩𝟑 … 𝐩𝒋 … 𝐩𝒏 # 26

28.

IP-Greedy: Algorithm Step 1 (Init of 𝐒: add the item having the highest inner product with 𝐪 into 𝐒)  Reduce #inner product computations by using 𝐩 ⋅ 𝐪 ≤ 𝐩  Update 𝛾 (sort key) 𝐪 Scan 𝐏 𝐩𝟏 𝐩𝟐 𝐩𝟑 𝐒 (result set) … 𝐩𝒋 … 𝐩𝟏 𝐩𝒏 (𝐩𝑖 : the item added in the 𝑖-th iteration) Steps 2 & 3 (Update of S: compute 𝑓 𝐩, 𝐒 for each 𝐩 ∈ 𝐏 ∖ 𝐒, then 𝐩∗ = argmax 𝑓 𝐩, 𝐒 is added into 𝐒) 𝐩∈𝐏∖𝐒  Partial sort by 𝛾  Employ theorem 1 and 2 to efficiently compute 𝐩∗ Scan 𝐏 𝐒 𝐩𝟏 𝐩𝟐 𝐩𝟑 𝐩𝟏 𝐩 𝟐 𝐩𝟑 … … 𝐩𝒋 … 𝐩𝒏 𝐒 (result set) 𝐩∗ 𝑓 𝐩, 𝐒 comp. 𝐩𝟏 𝐩𝟐 𝐩𝟑 … 𝐩𝒌 # 27

29.

Experimental result: Impact of 𝝀 (𝒌 = 𝟏𝟎) (A smaller 𝜆 yields higher div.) Amazon-M (#items = 200,941) 0.25 𝜆 0.5 0.75 Time [msec] MMR Min-Dist Time MMR Min-Dist Time MMR Min-Dist FEXIPRO 1.47 1.25 0.32 1.47 2.44 0.32 1.47 3.63 0.32 ScaNN 0.08 1.23 0.40 0.08 2.39 0.40 0.08 3.55 0.40 Greedy 940.46 1.69 4.58 933.98 2.54 2.13 938.36 3.64 0.50 IP-Greedy 147.82 1.69 4.58 83.26 2.54 2.13 59.80 3.64 0.50 MovieLens (#items = 59,047) 0.25 𝜆 0.5 0.75 Time MMR Min-Dist Time MMR Min-Dist Time MMR Min-Dist FEXIPRO 0.40 1.47 3.39 0.40 2.61 3.39 0.40 3.76 3.39 ScaNN 0.07 1.47 3.56 0.07 2.60 3.56 0.07 3.73 3.56 Greedy 253.36 1.59 5.94 254.42 2.63 3.86 253.51 3.76 3.58 IP-Greedy 30.90 1.59 5.94 14.48 2.63 3.86 14.30 3.76 3.58 # 28

30.

Experimental result: Impact of 𝝀 (𝒌 = 𝟏𝟎) (A smaller 𝜆 yields higher div.) Amazon-M (#items = 200,941) 0.25 𝜆 0.5 0.75 Time [msec] MMR Min-Dist Time MMR Min-Dist Time MMR Min-Dist FEXIPRO 1.47 1.25 0.32 1.47 2.44 0.32 1.47 3.63 0.32 ScaNN 0.08 1.23 0.40 0.08 2.39 0.40 0.08 3.55 0.40 Greedy 940.46 1.69 4.58 933.98 2.54 2.13 938.36 3.64 0.50 IP-Greedy 147.82 1.69 4.58 83.26 2.54 2.13 59.80 3.64 0.50 MovieLens (#items = 59,047) 0.25 𝜆 0.5 0.75 Time MMR Min-Dist Time MMR Min-Dist Time MMR Min-Dist FEXIPRO 0.40 1.47 3.39 0.40 2.61 3.39 0.40 3.76 3.39 ScaNN 0.07 1.47 3.56 0.07 2.60 3.56 0.07 3.73 3.56 Greedy 253.36 1.59 5.94 254.42 2.63 3.86 253.51 3.76 3.58 IP-Greedy 30.90 1.59 5.94 14.48 2.63 3.86 14.30 3.76 3.58 # 29

31.

Summary 1. Formalization of the diversity-aware 𝑘-MIPS problem 2. Proposal of IP-Greedy that returns the same solution as the standard greedy algorithm 3. Extensive experiments on real datasets (qualitative and quantitative evaluations)  IP-Greedy is efficient and guarantees a more diversified result compared with 𝑘-MIPS.  Sorce code is available at https://github.com/peitaw22/IP-Greedy # 30

https://github.com/peitaw22/IP-Greedy

Daichi Amagata

関連スライド

Fair Clustering & Search 2

Independent Range Sampling on Interval Data@ICDE2024

Random Sampling over Spatial Range Joins@ICDE2025

Diversity Maximization in the Presence of Outliers@AAAI2023

Fast and Exact Outlier Detection in Metric Spaces: A Proximity Graph-based Approach@SIGMOD2021

Fair Clustering & Search

各ページのテキスト