-- Views
April 20, 26
スライド概要
明治大学 総合数理学部 先端メディアサイエンス学科 中村聡史研究室
Improving Data Quality via Pre - Task Participant Screening in Crowdsourced GUI Experiments Pre - task Main task Screening improves model fit ( Takaya Miyama , Satoshi Nakamura (Meiji University) Shota Yamanaka ( LY Corporation ) R² ) 1
Background: Crowdsourced GUI experiments Advantages • Fast recruitment: > 1,000 participants in a few hours • Large samples: help evaluate performance models and rare events (e.g., pointing errors) [1]. Disadvantages • Low observability: inattentive/nonconforming behavior can reduce data quality [2]. • Results may differ from lab: faster but less accurate performance [3] → distort model evaluation. Need a way to screen out inattentive/nonconforming participants for reliable model evaluation. [1] Yamanaka, HCOMP 2021, [2] Brühlmann +, Methods in Psychology, 2020 , [3] Findlater+, CHI 2017 2
What does “inattentive/nonconforming” look like? Conforming Partially conforming Highly nonconforming careful, accurate faster, less accurate minimal effort, random actions 3
What does “inattentive/nonconforming” look like? Conforming careful, accurate 4
What does “inattentive/nonconforming” look like? Partially conforming faster, less accurate 5
What does “inattentive/nonconforming” look like? Highly nonconforming minimal effort, random actions 6
What does “inattentive/nonconforming” look like? Conforming Partially conforming Highly nonconforming careful, accurate faster, less accurate minimal effort, random actions The same task, very different data quality. → Need screening before the main task. 7
Approach: Pre - task screening before the main task • Run a pre - task first; only • Screen out passing participants inattentive/nonconforming proceed to the main task. participants (not selecting top performers). Pre - task Main task Screen out non - passing participants All participants start here. . Only passing participants proceed. 8
Pre - task: Size - adjustment Resize the on - screen card image to match a physical card [4]. • Brief: < 10 seconds on average • Task - relevant : accurate operation is relevant to GUI tasks (e.g., pointing). • Screening rule : Use the size - adjustment error between the on - screen card and physical card. → passing if below threshold, non - passing otherwise. size - adjustment error [4] Li+, Scientific Reports, 2020 9
Evaluation overview 1. Crowdsourced experiment (data collection) 2. Simulation: test whether the screening improves Pre - task : size - adjustment (pre - task) → pointing (main task). model fit. Main task 10
Crowdsourced experiment : Pre - task (size - adjustment) Resize the on - screen card image to match a physical card. • Reference card: ISO/IEC 7810 ID - 1 (e.g., credit, ID, transit cards); match the short side (53.98 mm) • Device: iPhone - only (7+); infer device PPI, convert . px → mm . • Measure: absolute size - adjustment error (mm). size - adjustment error 11
Crowdsourced experiment : Main task (pointing) Tap the two targets alternately. • Design: • W (mm): 9 levels (2.0, 2.8, 3.6, 4.4, 5.2, 6.0, 6.8, 7.6, 8.4 • Trials: 360 per participant • Measures: ) movement time ( MT ) and error rate ( ER ) → model fit ( R² ) 30 mm W 12
Crowdsourced experiment: Data collection • Platform: Yahoo! Crowdsourcing (no pre • Participants: - screening) N = 519 analyzed • Time: 5 min 27 s on average Pre - task Main task 13
Crowdsourced experiment: Pre - task outcome • 310 (60%) had ≤ 2 mm error ( likely passing , conforming ). • 143 (28%) had ≥ 10 mm error ( likely non - passing , highly inattentive/nonconforming ). 14
Crowdsourced experiment: Pre - task outcome • 310 (60%) had ≤ 2 mm error ( likely passing , conforming ). • 143 (28%) had ≥ 10 mm error ( likely non - passing , highly inattentive/nonconforming ). The pre - task outcome is continuous (no single cutoff) → evaluate screening under different threshold values. 15
Simulation: Does screening improve model fit? If the pre - task can screen out participants likely to be nonconforming, mixing more non - passing participants should reduce model fit in the main task. Parameters: • N: simulated sample size ( N = 80 ) • T (mm) : threshold on the pre - task outcome for defining • X (%) : ratio of non - passing participants mixed into the sample ( passing / non - passing (T = 1 –10, step 1) X = 0 –100% , step 10). Models: • 𝑀𝑇 = 𝑎 + 𝑏 ∙ log 2 • 𝐸𝑅 = 1 − erf 𝐴 +1 𝑊 𝑊 2 2𝜎𝑦 [5] Fitts, Journal of Experimental Psychology, [5] [6] 1954, [6] Yamanaka +, ISS 2020 16
Simulation results: How to read the R² heatmaps • Each cell shows R² for a ( T , X ) pair. • Right: X↑ (more non - passing mixed) / Down: T↑ (less strict screening). Non - passing Threshold T (mm) ratio X (%) 0% 50% 100% ●●●●● ●●●●● ●●●●● ◆◆◆◆◆ ◆◆◆◆◆ ◆◆◆◆◆ high low 17
Simulation results: ER model fit (R²) • Clear degradation: R² drops as X↑ (more non - passing mixed) and • Best fit: the top - left cell ( T = 1mm, X = 0% ). Non - passing Threshold T (mm) T↑ (less strict screening). ratio X (%) 0% 50% 100% ●●●●● ●●●●● ●●●●● ◆◆◆◆◆ ◆◆◆◆◆ ◆◆◆◆◆ high low 18
Simulation results: ER model fit (R²) • Clear degradation: R² drops as X↑ (more non - passing mixed) and • Best fit: the top - left cell ( T = 1mm, X = 0% ). Non - passing Threshold T (mm) T↑ (less strict screening). ratio X (%) 0% 50% 100% ●●●●● ●●●●● ●●●●● ◆◆◆◆◆ ◆◆◆◆◆ ◆◆◆◆◆ 0.989 high R² = 0.989 R² = 0.853 0.853 low 19
Simulation results: ER model fit (R²) • Clear degradation: R² drops as X↑ (more non - passing mixed) and • Best fit: the top - left cell ( T = 1mm, X = 0% ). Non - passing Threshold T (mm) T↑ (less strict screening). ratio X (%) 0% 50% 100% ●●●●● ●●●●● ●●●●● ◆◆◆◆◆ ◆◆◆◆◆ ◆◆◆◆◆ 0.989 high R² = 0.989 R² = 0.853 Keeping T strict and X small improves model fit, reducing the risk of misleading model evaluation due to nonconforming data. 0.853 low 20
Simulation results: MT model fit (R²) • Limited degradation: → accuracy of the pre R² drops as X↑ and T↑ too, but the change is smaller than for ER . - task operation is more clearly reflected in tap failure (ER) than in speed (MT) . Non - passing Threshold T (mm) ratio X (%) 0% 50% 100% ●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●● high low 21
Conclusion & future work Main points: • A brief pre - task (< 10 s size adjustment) enables screening using • Strict screening and less nonconforming data only pre - task outcomes improve data quality (model fit, . R² ). Limitations: • May miss participants who are • Choosing threshold is a trade conforming in the pre - task but nonconforming - off (stricter → fewer nonconforming data in the main task . , smaller N). Next: • Compare with traditional methods (e.g., gold tasks, attention checks). • Test whether the screening can be applied to other GUI tasks (e.g., dragging, steering, crossing). I would appreciate it if you could ask questions slowly and in simple English. 22