2.9K Views
December 19, 23
スライド概要
2023年12月16日の統計数理研究所共同研究集会「データ解析環境Rの整備と利用」(略称R研究集会)でご報告させていただいた、CRANに登録した外れ値検出パッケージのご紹介です
Rユーザーです。
R 2023 RMSD 12 RMSDp 16
2
個別データの 品質向上 データチェック MSD法の実用化 MSD法の並列化・他手法比較 集計値の 品質向上 補完値の 品質向上 乗率調整 ロバスト回帰ウエイトによ る乗率の調整 回帰補定 IRLSの機能評価 楕円分布モデル 回帰モデル
( ( ( ) ) ) Reference: UN/ECE Data Editing Group
Barnet & Lewis (1994). Outliers in Statistical Data, 3rd ed., Wiley, New York : !
https://www.stat.go.jp/trainin g/2kenkyu/2-2-723.html ← : • • • •
: (imputation) : EU : EU : 2001.03.01 2003.02.28 HP: https://www.cs.york.ac.uk/euredit/euredit-main.html
Stahel-Donoho MSD Franklin & Brodeur (1997) A practical application of a robust multivariate outlier detection method, Proceedings of the Survey Research Methods Section, American Statistical Association, pp. 186-191. Euredit Béguin & Hulliger (2003) Robust Multivariate Outlier Detection and Imputation With Incomplete Survey Data, EUREDIT Deliverable D4/5.2.1/2 Part C. (2010) R Euredit Euredit
9
SD [ (Marrona & Yohai, 1985) 1: PCA ] (Friedman & Tukey, 1974) [1] [ 2: Patak (1990) [ 3: ] 1 2 ] 10
MSD ( ) 11
) 12
10 EUREDIT 93 1 2 41 3 13
-1 EUREDIT 14
EUREDIT “Huber-like” 0 0 15
| | / .
-2 17
18
Ⅱ MSD 19
Ⅱ MSD 20
Hertzsprung-Russel 7 data(starsCYG, package="robustbase") 99% 点 95% 点 6 99% 点 99% 点 5 99% 点 4 99% 点 マハラノビス距離 カナダ版 カナダ基底増版 EUREDIT版 EUREDIT基底減版 99.9% 点 99.9% 点 3 V2 99.9% 点 99.9% 点 3.5 4.0 4.5 V1 5.0 21
22
Euredit [ EUREDIT EUREDIT (2010)] (A: (A: EUREDIT (A: EUREDIT (A: B: ) B: ) B: EUREDIT ) B: EUREDIT ) 23
A. Maronna and Yohai(1995)を根拠に、数式はBéguin and Hulliger(2003)が決めた p 1 EUREDIT EUREDIT 2 3 4 5 6 7 8 9 10 10 10 10 10 10 10 10 10 10 41 93 208 466 1039 2319 5172 11539 25739 20 30 40 50 60 70 80 90 100 82 279 832 2330 6234 16233 41376 103851 257390 24
(p) 2 3 4 5 8 10 15 32bit 20 ※ (N) 21 31 52 93 5,172 25,740 94,774 3,925,749 1 8 (KB) 84 1 279 2 832 7 2,325 19 331,008 265 2,574,000 2,059 21,324,150 170,593 1,570,299,600 12,562,397
Euredit PROS EUREDIT MSD CONS EUREDIT 11 32bit 1 100 apply
foreach doParallel • for • Windows OS Linux [ Mac ] CRAN vignettes “Getting Started with doParallel and foreach” https://cran.rproject.org/web/packages/doParallel/vignettes/gettingstartedParalle l.pdf https://www.slideshare.net/k.wada/r-d27d
• R for 1 • • MSD 1 p 1 • Core 1 for() apply() Core 2 Core 3 Core 4
100 12.5GB 20 [wada & Tsubaki (2013)]
CRAN • RMSD • RMSDp 100 Euredit Euredit 11 RMSD::RMSD() 32bit 64bit RMSDp::RMSDp() RMSD::RMSD() PC
RMSD ( • • • • inp nb sd pt • • • • • • u V Wt mah FF F cf • ot ) ( 99.9% pt 1: ok, 2: )
RMSD
install.packages(“RMSD”)
library(RMSD)
data(starsCYG, package="robustbase")
o1.msd <- RMSD(starsCYG)
plot(starsCYG, pch=21, col="gray", bg=o1.msd$ot, cex=1.5)
#
n <- nrow(starsCYG);
d <- ncol(starsCYG)
eg <- eigen(o1.msd$V)
P <- eg$vectors
D <- matrix(rep(0, d*d), ncol=d)
diag(D) <- eg$values
PP <- solve(P)
cf <- o1.msd$cf*(n^2-1)*d / ((n-d)*n)
# o1.msd$cf
ax1 <- sqrt(cf * eg$values)
#
#
#
#
#
#
nb <- 0:200
#
dw <- 2 * pi / 200; w <- dw * nb
XA1 <- ax1[1] * cos(w); XA2 <- ax1[2] * sin(w)
XX1 <- t(t(PP) %*% t(cbind(XA1, XA2)) + o1.msd$u)
#
lines(XX1, col=2, lwd=3, lty=2)
points(o1.msd$u[1], o1.msd$u[2], pch=19, col=3, cex=2) #
0
99.9%
6.0 5.5 log.light 5.0 4.5 4.0 3.6 3.8 4.0 log.Te 4.2 4.4 4.6
RMSDp ( • • • • • • inp cores nb sd pt dv • • • • • u V wt mah cf • ot ) 99.9% pt 1: ok, 2: 10000
RMSDp
# install.packages(“RMSDp”)
Library(RMSDp)
download.file("https://archive.ics.uci.edu/static/public/109/wine.zip", tmp)
wine <- read.csv(unz(tmp, "wine.data"), header=F)
unlink(tmp)
dim(wine) # [1] 178 14
ot1 <- RMSDp(wine[,-1]); ot2 <- RMSDp(wine[,-1]); ot3 <- RMSDp(wine[,-1]); ot4 <- RMSDp(wine[,-1])
ot5 <- RMSDp(wine[,-1]); ot6 <- RMSDp(wine[,-1]); ot7 <- RMSDp(wine[,-1]); ot8 <- RMSDp(wine[,-1])
ot9 <- RMSDp(wine[,-1]); ot10 <- RMSDp(wine[,-1])
#
10
ot <- cbind(ot1$ot, ot2$ot, ot3$ot, ot4$ot, ot5$ot, ot6$ot, ot7$ot, ot8$ot, ot9$ot, ot10$ot)
s1 <- apply(ot-1, 1, sum)
which(s1 > 0)
# [1] 70 74 96 122 #
s1[which(s1 > 0)]
# 10 5 10 10
#
fg1 <- rep(1, nrow(wine)) #
fg1[which(s1 > 0)] <- 2
fg1[which(s1 == 5)] <- 3
parcoord(wine[,-1], col=fg1, lty=c(3,1,1)[fg1], lwd=c(1,2,2)[fg1])
#
n <- nrow(wine); d <- ncol(wine) -1;
# Q-Q
par(mfrow=c(2,1))
qqplot(qchisq(ppoints(n), df=d), ot1$mah, pch=19, col=fg1[order(ot1$mah)], main="Q-Q plot No.1")
abline(0, 1, col="green")
qqplot(qchisq(ppoints(n), df=d), ot2$mah, pch=19, col=fg1[order(ot2$mah)], main="Q-Q plot No.2")
abline(0, 1, col="green")
(1): V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14
(2): QQ 40 30 10 20 ot1$mah 50 60 Q-Q plot No.1 5 10 15 20 25 30 25 30 qchisq(ppoints(n), df = d) 40 30 20 10 ot2$mah 50 60 Q-Q plot No.2 5 10 15 20 qchisq(ppoints(n), df = d)
MSD 和田かず美 (2010) 多変量外れ値の検出~MSD法とその改良手法 について~. 統計研究彙報, 総務省統計研修所, 67, 89-157. シングルコア版の実装。カナダ版とEuredit版の性能比較 K. Wada and H. Tsubaki, "Parallel Computation of Modified StahelDonoho Estimators for Multivariate Outlier Detection," 2013 International Conference on Cloud Computing and Big Data, Fuzhou, China, 2013, pp. 304-311, doi: 10.1109/CLOUDCOMASIA.2013.86. 並列化版実装とチューニング Wada, K., Kawano, M., & Tsubaki, H. (2020). Comparison of multivariate outlier detection methods for nearly elliptical distributions. Austrian Journal of Statistics, 49(2), 1-17. 統計調査への適用を目的とした手法比較(BACON, Fast-MCD, NNVEと)
Maronna, R. A., and V. J. Yohai 1995 , The behavior of the Stahel-Donoho robust multivariate estimator, Journal of the American Statistical Association, 90(429), 330-341. Friedman, J. H., & Tukey, J. W. (1974). A projection pursuit algorithm for exploratory data analysis. IEEE Transactions on computers, 100(9), 881-890. Patak, Z. (1990). Robust principal component analysis via projection pursuit, M. Sc. Thesis, University of British Columbia, Canada.
Thank you for your hearing. [email protected] https://github.com/kazwd2008/