Independent Range Sampling on Interval Data@ICDE2024

414 Views

June 13, 24

#区間データ #独立範囲サンプリング #範囲検索 #アルゴリズム #データ構造

スライド概要

Presenting our work "Independent Range Sampling on Interval Data" at ICDE2024

Daichi Amagata

@8828547526

スライド一覧

I'm a DB and AI researcher.

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

Fair Clustering & Search 2

Daichi Amagata 724

2022-DBS-amagata

Daichi Amagata 353

Random Sampling over Spatial Range Joins@ICDE2025

Daichi Amagata 326

Diversity Maximization in the Presence of Outliers@AAAI2023

Daichi Amagata 147

Fast and Exact Outlier Detection in Metric Spaces: A Proximity Graph-based Approach@SIGMOD2021

Daichi Amagata 112

Fair Clustering & Search

Daichi Amagata 108

各ページのテキスト

Independent Range Sampling on Interval Data Daichi Amagata Osaka University

Background: Range Search on Interval Data No. 1 • An interval 𝑥 = [𝑥. 𝑙, 𝑥. 𝑟] • A set of 𝑛 intervals 𝑋 = {𝑥1 , … , 𝑥𝑛 } • A query interval 𝒒 = [𝒒. 𝒍, 𝒒. 𝒓] finds 𝒙|𝒙 ∈ 𝑿, 𝒙 ∩ 𝒒 = 𝒒 ∩ 𝑿, “all” intervals ∈ 𝑿 that overlap 𝒒. Ex.1) Vehicle management systems: Show vehicles that were active between 17:00 and 22:00 a week ago. Ex.2) Book management systems: Collect books that were borrowed in the last month. Ex.3) Historical cryptocurrency databases: Show when the price of Bitcoin (BTC) falls in [30,000, 40,000] dollars.

Background: Issues Due to “Too Large Result Set” No. 2 • An interval 𝑥 = [𝑥. 𝑙, 𝑥. 𝑟] • A set of 𝑛 intervals 𝑋 = {𝑥1 , … , 𝑥𝑛 } • A query interval 𝒒 = [𝒒. 𝒍, 𝒒. 𝒓] finds 𝒙|𝒙 ∈ 𝑿, 𝒙 ∩ 𝒒 = 𝒒 ∩ 𝑿, “all” intervals ∈ 𝑿 that overlap 𝒒. Ex.1) Vehicle management systems: Show vehicles that were active between 17:00 and 22:00 a week ago. Database system view: A range search must incur 𝛀 to obtain its result set. 𝒙|𝒙 ∈ 𝑿, 𝒙 ∩ 𝒒 = 𝑶(𝒏) time Still 10,000,000 intervals! query interval 𝑞 Database 𝑋 of a billion intervals 𝑥 | 𝑥 ∈ 𝑋, 𝑥 ∩ 𝑞 Issue 1: not easy to analyze/visualize Issue 2: slow response

Background: Range Sampling No. 3 • An interval 𝑥 = [𝑥. 𝑙, 𝑥. 𝑟] • A set of 𝑛 intervals 𝑋 = {𝑥1 , … , 𝑥𝑛 } • A query interval 𝒒 = [𝒒. 𝒍, 𝒒. 𝒓] randomly samples some intervals from 𝒒 ∩ 𝑿. Ex.1) Vehicle management systems: Show vehicles that were active between 17:00 and 22:00 a week ago. Ex.1) Vehicle management systems: Show randomly sampled vehicles that were active between 17:00 and 22:00 a week ago. Issue 1: not easy to analyze/visualize easy to analyze/visualize & accelerated efficiency (if only samples are obtained)

Preliminary No. 4 Problem Definition: IRS (Independent Range Sampling) on interval data Input: 𝑋 (a set of 𝑛 intervals), 𝑞 (query interval), and 𝑠 (#samples) Output: 𝑆 (a set of 𝑠 intervals, each of which is picked from 𝑞 ∩ 𝑋 uniformly at random) First attempt: How about (1) running a search algorithm (e.g., [1]) (2) and then do random sampling? query interval 𝑞 (2) Random sampling 𝑥 | 𝑥 ∈ 𝑋, 𝑥 ∩ 𝑞 Claim (no merit) The above algorithm incurs Ω 𝑞 ∩ 𝑋 Database 𝑋 (1) Range search = 𝑂(𝑛) time. [1] “Hint: A hierarchical index for intervals in main memory,” In SIGMOD, 2022. 𝑆

Contributions No. 5 Challenge How to obtain 𝑠 random samples without running a search algorithm? Findings We prove that there exists a • • 1 ෨ 𝑂෨ 𝑠 time and 𝑂(𝑛) space algo. that picks 𝑠 samples, each of which is obtained with probability 𝑋∩𝑞 ෨ 𝑂෨ 𝑠 time and 𝑂(𝑛) space algo. that picks 𝑠 weighted samples, each of which is obtained with probability σ 𝑂෨ ⋅ hides any polylog factors. Experiments • We demonstrate that our algorithms are much faster than baseline algorithms. • Our implementation is available at https://github.com/amgt-d1/IRS-interval. 1 𝑥𝑗 ∈𝑞∩𝑋 𝑤(𝑥𝑗 )

https://github.com/amgt-d1/IRS-interval

Why Existing Techniques are NOT Sufficient? No. 6 𝑥1 Use 1-dimensional IRS algorithm [2]? 0. Put all endpoints into a sorted array 𝑥2 𝑥8 𝑞 𝑥3 𝑥4 𝑥9 𝑥6 𝑥5 𝑥10 𝑥7 𝑥11 1. Run two binary searches to find boundaries 2. Random sampling from a set of points within the boundaries ⊳ This approach cannot guarantee “equal sampling probability”. 𝑥8. 𝑙 𝑥2. 𝑙 𝑥4. 𝑙 𝑥8. 𝑟 𝑥1. 𝑙 𝑥9. 𝑙 𝑥2. 𝑟 𝑥4. 𝑟 2 points: 𝑥2, 𝑥4 whereas 1 point: 𝑥8, 𝑥1, 𝑥9 Use 2-dimensional IRS algorithm? • Interval data can be transformed into 2-dimensional points [3]. • After the transformation, KDS (2-dimensional IRS algorithm) [4] can be used. ⊳ This approach needs 𝑶( 𝒏 + 𝒔) time, still having a room to improve. [2] “Independent range sampling,” In PODS, 2014. [3] “A general technique for top-k geometric intersection query problems,” TKDE, 2014. [4] “Spatial independent range sampling,” In SIGMOD, 2021. We reduce this time complexity. ⋯ 𝑥7. 𝑟

Building Block of Our Technique No. 7 Interval tree: a balanced binary tree for interval data Structure • Each interval is maintained at the firstly stabbed node (e.g., 𝑥1 at 𝑢𝑟𝑜𝑜𝑡 , 𝑥3 at 𝑢2) • Each node 𝑢𝑖 maintains • a value 𝑐𝑖 (centroid) • 𝐿𝑙𝑖 and 𝐿𝑟𝑖 which respectively are lists of intervals sorted 𝐿𝑙1 = 𝑥2 , 𝑥4 𝐿𝑟1 = [𝑥2 , 𝑥4 ] in ascending order of left-endpoint and right-endpoint Range search 𝑢1 Traverse overlapping nodes from 𝑢𝑟𝑜𝑜𝑡 in the following way: 𝑢2 𝑢3 Case 1: If 𝑞. 𝑟 < 𝑐𝑖 , traverse only left child node (e.g., 𝑞1 at 𝑢𝑟𝑜𝑜𝑡 ) 𝑐3 Case 2: If 𝑐𝑖 < 𝑞. 𝑙, traverse only right child node (e.g., 𝑞1 at 𝑢1 ) 𝑥2 Case 3: If 𝑞. 𝑙 ≤ 𝑐𝑖 ≤ 𝑞. 𝑟, traverse both left and right child nodes (e.g., 𝑞2 at 𝑢2 ) ⊳ This approach needs 𝑶(𝒏) time, because of no bounds for case 3. 𝑢𝑟𝑜𝑜𝑡 𝐿𝑙𝑟𝑜𝑜𝑡 = 𝑥1 , 𝑥6 𝐿𝑟𝑟𝑜𝑜𝑡 = [𝑥6 , 𝑥1 ] 𝑥8 𝑢5 𝑢4 𝑐1 𝑥1 𝑐4 𝑐𝑟𝑜𝑜𝑡 𝑥3 𝑥4 𝑥9 𝑥6 𝑥5 𝑥10 𝑞1 𝑐5 𝑢6 𝑐2 𝑥7 𝑥11 𝑞2 𝑐6

Main Idea 1 No. 8 Interval tree: a balanced binary tree for interval data Structure • Each interval is maintained at the firstly stabbed node (e.g., 𝑥1 at 𝑢𝑟𝑜𝑜𝑡 , 𝑥3 at 𝑢2) • Each node 𝑢𝑖 maintains 𝐿𝑙𝑖 = 𝒙𝟏 , 𝒙𝟐 , 𝒙𝟑 , 𝒙𝟒 , 𝑥5 , 𝑥6 • a value 𝑐𝑖 (centroid) • 𝐿𝑙𝑖 and 𝐿𝑟𝑖 which respectively are lists of intervals sorted Exploiting this structure: in ascending order of left-endpoint and right-endpoint A binary search identifies the boundary where 𝑞 overlaps. Range search ⊳ Only 𝑂(log 𝑛) time for each node Traverse overlapping nodes from 𝑢𝑟𝑜𝑜𝑡 in the following way: Case 1: If 𝑞. 𝑟 < 𝑐𝑖 , traverse only left child node (e.g., 𝑞1 at 𝑢𝑟𝑜𝑜𝑡 ) Case 2: If 𝑐𝑖 < 𝑞. 𝑙, traverse only right child node (e.g., 𝑞1 at 𝑢1 ) Case 3: If 𝑞. 𝑙 ≤ 𝑐𝑖 ≤ 𝑞. 𝑟, traverse both left and right child nodes (e.g., 𝑞2 at 𝑢2 ) ⊳ This approach needs 𝑂(𝑛) time, because of no bounds for case 3.

10.

Main Idea 2 No. 9 Interval tree: a balanced binary tree for interval data Make this case at most once: Structure If we can achieve this, we need only Cases 1 & 2: 𝑂 log 𝑛 × 𝑂 log 𝑛 = 𝑂෨ 1 + • Each interval is maintained at the firstly stabbed node (e.g., 𝑥1 at 𝑢𝑟𝑜𝑜𝑡 , 𝑥3 at 𝑢2) • Each node 𝑢𝑖 maintains • a value 𝑐𝑖 (centroid) • 𝐿𝑙𝑖 and 𝐿𝑟𝑖 which respectively are lists of intervals sorted Case 3: 𝑂 log 𝑛 × 𝑂 1 = 𝑂෨ 1 time to identify the sample candidate. 𝐿𝑙1 = 𝑥2 , 𝑥4 𝐿𝑟1 = [𝑥2 , 𝑥4 ] in ascending order of left-endpoint and right-endpoint Range search 𝑢𝑟𝑜𝑜𝑡 𝐿𝑙𝑟𝑜𝑜𝑡 = 𝑥1 , 𝑥6 𝐿𝑟𝑟𝑜𝑜𝑡 = [𝑥6 , 𝑥1 ] 𝑢1 Traverse overlapping nodes from 𝑢𝑟𝑜𝑜𝑡 in the following way: 𝑢2 𝑢3 Case 1: If 𝑞. 𝑟 < 𝑐𝑖 , traverse only left child node (e.g., 𝑞1 at 𝑢𝑟𝑜𝑜𝑡 ) 𝑐3 Case 2: If 𝑐𝑖 < 𝑞. 𝑙, traverse only right child node (e.g., 𝑞1 at 𝑢1 ) 𝑥2 Case 3: If 𝑞. 𝑙 ≤ 𝑐𝑖 ≤ 𝑞. 𝑟, traverse both left and right child nodes (e.g., 𝑞2 at 𝑢2 ) 𝑥8 𝑢5 𝑢4 𝑐1 𝑥1 𝑐4 𝑐𝑟𝑜𝑜𝑡 𝑥3 𝑥4 𝑥9 𝑥6 𝑐5 𝑥5 𝑥10 𝑢6 𝑐2 𝑐6 𝑥7 𝑥11 𝑞 How to access 𝑥9 and 𝑥10 without incurring 𝑂 𝑛 node accesses?

11.

New Data Structure No. 10 AIT: Augmented Interval Tree Difference to interval tree • Each interval is maintained at the firstly stabbed node (e.g., 𝑥1 at 𝑢𝑟𝑜𝑜𝑡 , 𝑥3 at 𝑢2) • Each node 𝑢𝑖 maintains • a value 𝑐𝑖 (centroid) • 𝐿𝑙𝑖 and 𝐿𝑟𝑖 which respectively are lists of intervals sorted in ascending order of left-endpoint and right-endpoint • 𝑨𝑳𝒍𝒊 and 𝑨𝑳𝒓𝒊 which respectively are lists of all intervals existing in the 𝐿𝑙1 = [𝑥2 , 𝑥4 ], 𝐿𝑟1 = [𝑥2 , 𝑥4 ] 𝐴𝐿𝑙1 = [𝑥8 , 𝑥2 , 𝑥4 , 𝑥9 ] 𝐴𝐿𝑟1 = [𝑥8 , 𝑥2 , 𝑥4 , 𝑥9 ] 𝑢1 subtree rooted at 𝑢𝑖 sorted in asc. order of left-endpoint and right-endpoint. At 𝑢2 , 𝑥3 , 𝑥10 , and 𝑥5 are found from 𝐴𝐿𝑙2 . 𝑐3 (At 𝑢𝑟𝑜𝑜𝑡 , 𝑥1 and 𝑥6 are found.) • Theoretical time efficiency: we traverse at most log 𝑛 (= height) nodes. • Practical efficiency: when we meet case 3, its child nodes are the last ones to be traversed (i.e., no need to traverse their descendants). 𝑢2 𝑢3 At 𝑢1 , 𝑥9 is found from 𝐴𝐿𝑟1 . Merits of this augmentation 𝑢𝑟𝑜𝑜𝑡 𝐿𝑙2 = [𝑥3 , 𝑥5 ], 𝐿𝑟2 = [𝑥5 , 𝑥3 ] 𝐴𝐿𝑙2 = [𝑥3 , 𝑥10 , 𝑥5 , 𝑥7 , 𝑥11 ] 𝐴𝐿𝑟2 = [𝑥10 , 𝑥5 , 𝑥3 , 𝑥11, 𝑥7 ] 𝑥2 𝑥8 𝑢5 𝑢4 𝑐1 𝑥1 𝑐4 𝑐𝑟𝑜𝑜𝑡 𝑥3 𝑥4 𝑥9 𝑥6 𝑐5 𝑥5 𝑥10 𝑢6 𝑐2 𝑐6 𝑥7 𝑥11 𝑞 How to access 𝑥9 and 𝑥10 without incurring 𝑂 𝑛 node accesses?

12.

Our IRS Algorithm No. 11 Case 1 Case 2 Cases 1 & 2: Identify the boundaries in 𝐿𝑙𝑖 or 𝐿𝑟𝑖 ⊳ 𝑂(log 𝑛) boundaries in total 𝐿𝑙𝑟𝑜𝑜𝑡 = 𝒙𝟏 , 𝒙𝟔 𝐿𝑟𝑟𝑜𝑜𝑡 = [𝑥6 , 𝑥1 ] because 𝑂(log 𝑛) nodes have this case (AIT’s height = 𝑂 log 𝑛 ) Case 3 Case 3: Identify the boundaries in 𝐴𝐿𝑗𝑙 and 𝐴𝐿𝑟𝑘 ⊳ 𝑂(1) boundaries in total + early termination 𝐿𝑙1 = [𝑥2 , 𝑥4 ], 𝐿𝑟1 = [𝑥2 , 𝑥4 ] 𝐴𝐿𝑙1 = [𝑥8 , 𝑥2 , 𝑥4 , 𝑥9 ] 𝐴𝐿𝑟1 = [𝑥8 , 𝑥2 , 𝑥4 , 𝒙𝟗 ] 𝑢1 because this case accesses at most additional two nodes 𝑢2 𝑢3 ෩ 𝟏 time to identify the candidate ⊳𝑶 𝑐3 Weighted sampling 𝑥2 (Each node has a different size of candidate) by Walker’s alias method: • 𝑂(log 𝑛) time to build an alias because of 𝑂(log 𝑛) boundaries • 𝑂(1) time to pick a random sample ⊳ 𝑶(𝐥𝐨𝐠 𝒏 + 𝒔) time to pick 𝒔 samples 𝑢𝑟𝑜𝑜𝑡 𝐿𝑙2 = [𝑥3 , 𝑥5 ], 𝐿𝑟2 = [𝑥5 , 𝑥3 ] 𝐴𝐿𝑙2 = [𝒙𝟑 , 𝒙𝟏𝟎 , 𝒙𝟓 , 𝑥7 , 𝑥11 ] 𝐴𝐿𝑟2 = [𝑥10 , 𝑥5 , 𝑥3 , 𝑥11, 𝑥7 ] 𝑥8 𝑢5 𝑢4 𝑐1 𝑥1 𝑐4 𝑐𝑟𝑜𝑜𝑡 𝑥3 𝑥4 𝑥9 𝑥6 𝑐5 𝑥5 𝑥10 𝑢6 𝑐2 𝑐6 𝑥7 𝑥11 𝑞 This query accesses only 𝑢𝑟𝑜𝑜𝑡 , 𝑢1, and 𝑢2.

13.

Theoretical Analysis No. 12 Theorem 1 (Space complexity) The space complexity of our algorithm (or an AIT) is 𝑂(𝑛 log 𝑛). Larger space than that of KDS [4] (i.e., 𝑂 𝑛 ). Theorem 2 (Time complexity) Our algorithm picks 𝑠 random samples in 𝑂(log 2 𝑛 + 𝑠) time. Theorem 3 (Correctness) Our algorithm picks a random samples with probability 1/|𝑞 ∩ 𝑋|. [4] “Spatial independent range sampling,” In SIGMOD, 2021. Less time complexity than that of KDS (i.e., 𝑂( 𝑛 + 𝑠) expected). Better success probability than 1 that of KDS (i.e., ). 𝛼 𝑞∩𝑋

14.

How to Reduce Space Complexity? No. 13 Definition 1 (virtual interval) Given a subset 𝑋𝑖 of 𝑋, the virtual interval of 𝑋𝑖 , 𝑣𝑖 , is defined as 𝑣𝑖 . 𝑙 = min 𝑥. 𝑙 and 𝑣𝑖 . 𝑟 = max 𝑥. 𝑟 𝑥∈𝑋𝑖 𝑥∈𝑋𝑖 Corollary 1 Given a set 𝑉 of virtual intervals such that 𝑉 = Θ 𝑛 , the space complexity of the AIT on 𝑉 is 𝑂 𝑛 . log 𝑛 How to partition 𝑿 (or how to alleviate false positives) Challenge: virtual intervals overlapping 𝑞 may have intervals that do not overlap 𝑞. Our approach: pair-sort Ex.) 𝑞 overlaps 𝑣1 : 𝑞 overlaps 𝑥1 and 𝑥2 but not 𝑥3 . ⊳ This corresponds to drawing a z-curve 𝑥1 𝑥3 𝑣1 𝑥2 𝑞 (sort by left-endpoint and break ties by right-endpoint) Our partition: disjoint and uniform partition based on the sort order

15.

How to Handle Weighted Intervals? No. 14 Why AIT is inefficient for the weighted case? Note: we may have 𝑤 𝑥𝑖 ≠ 𝑤(𝑥𝑗 ) The cumulative sum of the weights of the sequence of overlapping intervals is NOT known without accessing these intervals. (If we access these intervals, 𝑂(𝑛) time is incurred.) AWIT: Augmented Weighted Interval Tree Difference to AIT • Each node 𝑢𝑖 further maintains • • • 𝐿𝑙1 = [𝑥2 , 𝑥4 ], 𝐿𝑟1 = [𝑥2 , 𝑥4 ] 𝐴𝐿𝑙1 = [𝑥8 , 𝑥2 , 𝑥4 , 𝑥9 ] 𝐴𝐿𝑟1 = [𝑥8 , 𝑥2 , 𝑥4 , 𝒙𝟗 ] 𝑢1 𝑾𝒍𝒊 and 𝑾𝒓𝒊 which are respectively arrays maintaining the cumulative sums of weights of the intervals in 𝐿𝑙𝑖 and 𝐿𝑟𝑖 . 𝑨𝑾𝒍𝒊 and 𝑨𝑾𝒓𝒊 which are respectively arrays maintaining the cumulative sums of weights of the intervals in 𝐴𝐿𝑙𝑖 and 𝐴𝐿𝑟𝑖 . Only additional 𝑂(log 𝑛) time for each sampling. 𝑢𝑟𝑜𝑜𝑡 𝐿𝑙2 = [𝑥3 , 𝑥5 ], 𝐿𝑟2 = [𝑥5 , 𝑥3 ] 𝐴𝐿𝑙2 = [𝒙𝟑 , 𝒙𝟏𝟎 , 𝒙𝟓 , 𝑥7 , 𝑥11 ] 𝐴𝐿𝑟2 = [𝑥10 , 𝑥5 , 𝑥3 , 𝑥11, 𝑥7 ] 𝑢2 𝑢3 𝑐3 𝑥2 𝑥8 𝑢5 𝑢4 𝑐1 𝑥1 𝑐4 𝑐𝑟𝑜𝑜𝑡 𝑥3 𝑥4 𝑥9 𝑥6 𝑐5 𝑥5 𝑥10 𝑢6 𝑐2 𝑐6 𝑥7 𝑥11 𝑞 𝐴𝑊2𝑙 = 𝑤(𝑥3 , 𝑤 𝑥3 + 𝑤(𝑥10 ), 𝒘 𝒙𝟑 + 𝒘(𝒙𝟏𝟎 ) + 𝒘(𝒙𝟓 ), ⋯ ]

16.

Experiments: Dataset, Pre-processing Time, and Memory Usage Diverse applications: • Book management • Bitcoin • Vehicles (train & taxi) • • AIT needs more offline time and memory than the others, but they are still reasonable. AIT-V (AIT on 𝑉) alleviates both costs. No. 15

17.

Experiments: Query Processing Time (Decomposed Time) AIT & AIT-V limits the space where 𝑞 ∩ 𝑋 exist so quickly. • • AIT picks 1000 samples much faster than KDS. AIT-V is slower than AIT because it incurs checks of "𝑥 ∈ 𝑞 ∩ 𝑋". No. 16

18.

Experiments: Query Processing Time (Impact of Parameters) AIT & AIT-V are not affected by range size. AIT & AIT-V are less affected by dataset size (𝑛). Times of range sampling algorithms are linear to 𝑠. No. 17

19.

Experiments: Query Processing Time (Weighted Interval Case) AIT is less affected by range size. AWIT always outperforms the others. AWIT has a superior sampling efficiency. No. 18

20.

Summary No. 19 Problem Independent range sampling on interval data that returns 𝑠 intervals, each of which is picked from 𝑞 ∩ 𝑋 uniformly at random Findings We prove that there exists an • • 1 ෨ 𝑂෨ 𝑠 time and 𝑂(𝑛) space algo. that picks 𝑠 samples, each of which is obtained with probability 𝑋∩𝑞 ෨ 𝑂෨ 𝑠 time and 𝑂(𝑛) space algo. that picks 𝑠 weighted samples, each of which is obtained with probability σ Experiments • We demonstrate that our algorithms are much faster than a baseline algorithm. • Our implementation is available at https://github.com/amgt-d1/IRS-interval. 1 𝑥𝑗 ∈𝑞∩𝑋 𝑤(𝑥𝑗 )

https://github.com/amgt-d1/IRS-interval