Improving Data Quality via Pre-Task Participant Screening in Crowdsourced GUI Experiments

682 Views

April 20, 26

#クラウドソーシング #GUI実験 #データ品質 #参加者スクリーニング #モデル適合度

スライド概要

Nakamura Laboratory (Meiji University)

@nkmr-lab

スライド一覧

明治大学総合数理学部先端メディアサイエンス学科中村聡史研究室

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

周辺視野に対するぼかしエフェクトが作業時の集中力に及ぼす影響の調査

Nakamura Laboratory (Meiji University) 31.7K

商品選択においてフォントがユーザの選択行動に及ぼす影響の調査

Nakamura Laboratory (Meiji University) 24.3K

手書きとフォントの文字形状の違いによる記憶効果の比較

Nakamura Laboratory (Meiji University) 20.9K

Make-up FLOW 2.0: 美容系YouTuberの化粧フローチャートの共有・取り入れ手法

化粧メイク化粧工程フローチャート美容系youtuber 取り入れ

Nakamura Laboratory (Meiji University) 17.3K

ComiQA: A Comic Quiz Sharing Service that Helps Users to Recollect the Content of Previous Volumes

comic manga recollection qa service

Nakamura Laboratory (Meiji University) 17.5K

周辺視野における妨害刺激の減衰が集中度に及ぼす影響

Nakamura Laboratory (Meiji University) 17.1K

各ページのテキスト

Improving Data Quality via Pre - Task Participant Screening in Crowdsourced GUI Experiments Pre - task Main task Screening improves model fit ( Takaya Miyama , Satoshi Nakamura (Meiji University) Shota Yamanaka ( LY Corporation ) R² ) 1

Background: Crowdsourced GUI experiments Advantages • Fast recruitment: > 1,000 participants in a few hours • Large samples: help evaluate performance models and rare events (e.g., pointing errors) [1]. Disadvantages • Low observability: inattentive/nonconforming behavior can reduce data quality [2]. • Results may differ from lab: faster but less accurate performance [3] → distort model evaluation. Need a way to screen out inattentive/nonconforming participants for reliable model evaluation. [1] Yamanaka, HCOMP 2021, [2] Brühlmann +, Methods in Psychology, 2020 , [3] Findlater+, CHI 2017 2

What does “inattentive/nonconforming” look like? Conforming Partially conforming Highly nonconforming careful, accurate faster, less accurate minimal effort, random actions 3

What does “inattentive/nonconforming” look like? Conforming careful, accurate 4

What does “inattentive/nonconforming” look like? Partially conforming faster, less accurate 5

What does “inattentive/nonconforming” look like? Highly nonconforming minimal effort, random actions 6

What does “inattentive/nonconforming” look like? Conforming Partially conforming Highly nonconforming careful, accurate faster, less accurate minimal effort, random actions The same task, very different data quality. → Need screening before the main task. 7

Approach: Pre - task screening before the main task • Run a pre - task first; only • Screen out passing participants inattentive/nonconforming proceed to the main task. participants (not selecting top performers). Pre - task Main task Screen out non - passing participants All participants start here. . Only passing participants proceed. 8

Pre - task: Size - adjustment Resize the on - screen card image to match a physical card [4]. • Brief: < 10 seconds on average • Task - relevant : accurate operation is relevant to GUI tasks (e.g., pointing). • Screening rule : Use the size - adjustment error between the on - screen card and physical card. → passing if below threshold, non - passing otherwise. size - adjustment error [4] Li+, Scientific Reports, 2020 9

10.

Evaluation overview 1. Crowdsourced experiment (data collection) 2. Simulation: test whether the screening improves Pre - task : size - adjustment (pre - task) → pointing (main task). model fit. Main task 10

11.

Crowdsourced experiment : Pre - task (size - adjustment) Resize the on - screen card image to match a physical card. • Reference card: ISO/IEC 7810 ID - 1 (e.g., credit, ID, transit cards); match the short side (53.98 mm) • Device: iPhone - only (7+); infer device PPI, convert . px → mm . • Measure: absolute size - adjustment error (mm). size - adjustment error 11

12.

Crowdsourced experiment : Main task (pointing) Tap the two targets alternately. • Design: • W (mm): 9 levels (2.0, 2.8, 3.6, 4.4, 5.2, 6.0, 6.8, 7.6, 8.4 • Trials: 360 per participant • Measures: ) movement time ( MT ) and error rate ( ER ) → model fit ( R² ) 30 mm W 12

13.

Crowdsourced experiment: Data collection • Platform: Yahoo! Crowdsourcing (no pre • Participants: - screening) N = 519 analyzed • Time: 5 min 27 s on average Pre - task Main task 13

14.

15.

Crowdsourced experiment: Pre - task outcome • 310 (60%) had ≤ 2 mm error ( likely passing , conforming ). • 143 (28%) had ≥ 10 mm error ( likely non - passing , highly inattentive/nonconforming ). The pre - task outcome is continuous (no single cutoff) → evaluate screening under different threshold values. 15

16.

Simulation: Does screening improve model fit? If the pre - task can screen out participants likely to be nonconforming, mixing more non - passing participants should reduce model fit in the main task. Parameters: • N: simulated sample size ( N = 80 ) • T (mm) : threshold on the pre - task outcome for defining • X (%) : ratio of non - passing participants mixed into the sample ( passing / non - passing (T = 1 –10, step 1) X = 0 –100% , step 10). Models: • 𝑀𝑇 = 𝑎 + 𝑏 ∙ log 2 • 𝐸𝑅 = 1 − erf 𝐴 +1 𝑊 𝑊 2 2𝜎𝑦 [5] Fitts, Journal of Experimental Psychology, [5] [6] 1954, [6] Yamanaka +, ISS 2020 16

17.

Simulation results: How to read the R² heatmaps • Each cell shows R² for a ( T , X ) pair. • Right: X↑ (more non - passing mixed) / Down: T↑ (less strict screening). Non - passing Threshold T (mm) ratio X (%) 0% 50% 100% ●●●●● ●●●●● ●●●●● ◆◆◆◆◆ ◆◆◆◆◆ ◆◆◆◆◆ high low 17

18.

19.

Simulation results: ER model fit (R²) • Clear degradation: R² drops as X↑ (more non - passing mixed) and • Best fit: the top - left cell ( T = 1mm, X = 0% ). Non - passing Threshold T (mm) T↑ (less strict screening). ratio X (%) 0% 50% 100% ●●●●● ●●●●● ●●●●● ◆◆◆◆◆ ◆◆◆◆◆ ◆◆◆◆◆ 0.989 high R² = 0.989 R² = 0.853 0.853 low 19

20.

21.

Simulation results: MT model fit (R²) • Limited degradation: → accuracy of the pre R² drops as X↑ and T↑ too, but the change is smaller than for ER . - task operation is more clearly reflected in tap failure (ER) than in speed (MT) . Non - passing Threshold T (mm) ratio X (%) 0% 50% 100% ●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●● high low 21

22.

Conclusion & future work Main points: • A brief pre - task (< 10 s size adjustment) enables screening using • Strict screening and less nonconforming data only pre - task outcomes improve data quality (model fit, . R² ). Limitations: • May miss participants who are • Choosing threshold is a trade conforming in the pre - task but nonconforming - off (stricter → fewer nonconforming data in the main task . , smaller N). Next: • Compare with traditional methods (e.g., gold tasks, attention checks). • Test whether the screening can be applied to other GUI tasks (e.g., dragging, steering, crossing). I would appreciate it if you could ask questions slowly and in simple English. 22