2.5K Views
November 11, 23
スライド概要
A presentation slides of the invited talk in Developmental Interpretability Conference 2023 Nov.
ベイズ機械学習の理論を専攻してデータサイエンティストになった後,元の専門で博士を取った民間研究者です.以下のSlideShareからの転載をたぶんに含みます:https://www.slideshare.net/naokihayashi714
2023 Developmental Interpretability Conference Singular Learning Theory of Interpretable Parameter Restriction THIS TALK WAS INVITED BY PROF. MURFET. NAOKI HAYASHI 1
Symbol Notations 𝑞 𝑥 , 𝑞 𝑦|𝑥 : the true (i.e. data generating) distribution 𝑝 𝑥 𝑤 , 𝑝 𝑦|𝑤, 𝑥 : statistical model given parameter w 𝜑 𝑤 : prior distribution 𝐷𝑛 = 𝑋1 , 𝑌1 , … , 𝑋𝑛 , 𝑌𝑛 : i.i.d. sample (r.v.) from 𝑞 𝑦 𝑥 𝑞 𝑥 𝑃 𝑋 𝑛 𝑤 , 𝑃 𝐷𝑛 |𝑤 : likelihood 𝜓 𝑤 𝑋 𝑛 , 𝜓 𝑤 𝐷𝑛 : posterior distribution 𝑍𝑛 : marginal likelihood (a.k.a. evidence) 𝑝∗ 𝑥 , 𝑝∗ 𝑦|𝑥 : predictive distribution (of the output) 𝐹𝑛 : free energy (-log(Z)) 𝐺𝑛 : generalization error (KL(q||p*)) 2
1. Singular Learning Theory 2. Parameter Region Restriction Outline 3. Non-negative and Simplex 4. Concept Bottleneck Models 5. Summary 3
1. Singular Learning Theory 2. Outline 3. 4. 5. 4
1. Singular Learning Theory Problem Setting • Important random variables are the generalization error 𝐺𝑛 and the marginal likelihood 𝑍𝑛 . ‒ 𝐺𝑛 = 𝑥|𝑦 𝑞 𝑥 𝑞 ◼ 𝑞 𝑦|𝑥 log ∗ 𝑝 𝑦|𝑥 d𝑥d𝑦. It represents how different between the true and the predictive in the sense of a new data generating process. ‒ 𝑍𝑛 = 𝑤 𝜑 𝑤 𝑛𝐷 𝑃 d𝑤 . ◼ It represents how similar the true to the model in the sense of the dataset generating process. 5
1. Singular Learning Theory Problem Setting • Important random variables are the generalization error 𝐺𝑛 and the marginal likelihood 𝑍𝑛 . ‒ 𝐺𝑛 = 𝑥 𝑞 log ◼ 𝑞 𝑥 𝑝∗ 𝑥 d𝑥. It represents how different between the true and the predictive in the sense of a new data generating process. ‒ 𝑍𝑛 = ς𝑛𝑖=1 𝑝 𝑋𝑖 𝑤 𝜑 𝑤 d𝑤 . ◼ It represents how different between the true and the model in the sense of the dataset generating process. • How do they behave? 6
1. Singular Learning Theory Regular Case • From regular learning theory, if the posterior can be approximated by a normal dist., the followings hold: ‒ 𝔼 𝐺𝑛 = 𝑑 2𝑛 +𝑜 1 𝑛 , 𝑑 ‒ 𝐹𝑛 = 𝑛𝑆𝑛 + log 𝑛 + 𝑂𝑝 1 , 2 where 𝑑 is the dim. of params. and 𝑆𝑛 is the empirical entropy. • AIC and BIC are based on regular learning theory. 7
1. Singular Learning Theory Regular Case • From regular learning theory, if the posterior can be approximated by a normal dist., the followings hold: ‒ 𝔼 𝐺𝑛 = 𝑑 2𝑛 +𝑜 1 𝑛 , 𝑑 ‒ 𝐹𝑛 = 𝑛𝑆𝑛 + log 𝑛 + 𝑂𝑝 1 , 2 where 𝑑 is the dim. of params. and 𝑆𝑛 is the empirical entropy. How about singular cases? (singular = non-regular) • AIC and BIC are based on regular learning theory. 8
1. Singular Learning Theory Singular Case • Hierarchical models and latent variable models are typical singular models. • Their likelihood and posterior cannot be approximated by any normal dist. ‒ Simple example: the log likelihood is −𝑏 2 𝑏 − 𝑎3 in w=(a, b) space. 2 9
1. Singular Learning Theory Singular Case • Hierarchical models and latent variable models are typical singular models. • Their likelihood and posterior cannot be approximated by any normal dist. ‒ Simple example: the log likelihood is −𝑏 2 𝑏 − 𝑎3 in w=(a, b) space. 2 Regular learning theory cannot clarify the behavior of their generalization errors and marginal likelihoods. 10
1. Singular Learning Theory Singular Case • Singular learning theory provides a general theory for the above issue. • Suppose some technical assumption, the followings hold, even if the posterior cannot be approximated by any normal dist.: 𝜆 𝑛 ‒ 𝔼 𝐺𝑛 = − 𝑚−1 𝑛 log 𝑛 +𝑜 1 𝑛 log 𝑛 , Watanabe. 2001 ‒ 𝐹𝑛 = 𝑛𝑆𝑛 + 𝜆 log 𝑛 − 𝑚 − 1 log log 𝑛 + 𝑂𝑝 1 , where 𝜆, 𝑚 are constants which depend on 𝑝 𝑥 𝑤 , 𝜑 𝑤 , and 𝑞 𝑥 . 11
1. Singular Learning Theory Singular Case • Singular learning theory provides a general theory for the above issue. • Suppose some technical assumption, the followings hold, even if the posterior cannot be approximated by any normal dist.: 𝜆 𝑛 ‒ 𝔼 𝐺𝑛 = − 𝑚−1 𝑛 log 𝑛 +𝑜 1 𝑛 log 𝑛 , Watanabe. 2001 ‒ 𝐹𝑛 = 𝑛𝑆𝑛 + 𝜆 log 𝑛 − 𝑚 − 1 log log 𝑛 + 𝑂𝑝 1 , where 𝜆, 𝑚 are constants which depend on 𝑝 𝑥 𝑤 , 𝜑 𝑤 , and 𝑞 𝑥 . What are these constants? 12
1. Singular Learning Theory Real Log Canonical Threshold • Def. A real log canonical threshold (RLCT) is defined by a negative maximum pole of a zeta function 𝜁 𝑧 = න 𝐾 𝑤 𝑧 𝑏 𝑤 d𝑤 , 𝑊 ‒ where K(w) and b(w) are non-negative and analytic. • Thm. Put 𝐾 𝑤 = KL 𝑞||𝑝 and 𝑏 𝑤 = 𝜑 𝑤 , then the RLCT is the learning coefficient 𝜆 and the order of the maximum pole is Watanabe. 2001 the multiplicity 𝑚. This is an important result in singular learning theory. 13
1. Singular Learning Theory Real Log Canonical Threshold The lines are the set K(w)=0 in the parameter space and the star is the “deepest” singularity. 14
1. Singular Learning Theory Real Log Canonical Threshold Corresponding to the maximum pole of the zeta function 𝐶 𝜁 𝑧 = +⋯ 𝑚 𝑧+𝜆 𝒛 = −𝝀 The lines are the set K(w)=0 in the parameter space 𝐗𝐗 and the star is the “deepest” singularity. ℂ 𝐗 𝐎 15
1. Singular Learning Theory Real Log Canonical Threshold • Intuitive meaning of RLCT: an intrinsic dimension log 𝑉 𝑡 𝜆 = lim ,𝑉 𝑡 = න 𝜑 𝑤 d𝑤 . 𝑡→+0 log 𝑡 𝐾 𝑤 <𝑡 ‒ A volume dimension of the neighborhood of KL(q||p) = 𝐾 𝑤 = 0 𝑡 → +0 Image of 𝐾 𝑤 < 𝑡 Black+:K(w)=0 Red//: Integral range of 𝑉 𝑡 16
1. Singular Learning Theory Real Log Canonical Threshold • Intuitive meaning of RLCT: an intrinsic dimension log 𝑉 𝑡 𝜆 = lim ,𝑉 𝑡 = න 𝜑 𝑤 d𝑤 . 𝑡→+0 log 𝑡 𝐾 𝑤 <𝑡 ‒ A volume dimension of the neighborhood of KL(q||p) = 𝐾 𝑤 = 0, rational • Similar to Minkowski dimension 𝑑 ∗ log 𝒱 𝑡 ∗ 𝑑 = 𝑑 − lim ,𝒱 𝑡 = න 𝑡→+0 log 𝑡 dist d𝑤 . 𝑆,𝑤 <𝑡 ‒ A fractal dimension of 𝑆 ⊂ ℝ𝑑 , irrational 17
1. Singular Learning Theory Real Log Canonical Threshold • Intuitive meaning of RLCT: an intrinsic dimension log 𝑉 𝑡 𝜆 = lim ,𝑉 𝑡 = න 𝜑 𝑤 d𝑤 . 𝑡→+0 log 𝑡 𝐾 𝑤 <𝑡 ‒ A volume dimension of the neighborhood of KL(q||p) = 𝐾 𝑤 = 0, rational • Similar to Minkowski dimension 𝑑 ∗ log 𝒱 𝑡 ∗ 𝑑 = 𝑑 − lim ,𝒱 𝑡 = න 𝑡→+0 log 𝑡 dist d𝑤 . 𝑆,𝑤 <𝑡 ‒ A fractal dimension of 𝑆 ⊂ ℝ𝑑 , irrational Cf. RLCT dominates Bayesian generalization error (Singular Learning Theory) Minkowski dim dominates approximation/generalization error of some DNNs (Nakada, et. al. 2020) 18
1. Singular Learning Theory Real Log Canonical Threshold • Properties of 𝜆 and 𝑚 are as follows: ‒ 𝜆 is a positive rational # and 𝑚 is a positive integer. ‒ They are birational invariants. ◼ We can determine them using blowing-ups, mathematically supported by Hironaka’s Hironaka. 1964 Singularity Resolution Theorem. • Application of 𝜆 and 𝑚 are as follows: ‒ Nagata showed that an exchange probability in replica MCMC is represented Nagata. 2008 by using 𝜆. ‒ Drton proposed sBIC which approximates log 𝑍𝑛 by using the RLCTs and the multiplicities of candidates 𝑝𝑖 , 𝜑, 𝑞 = 𝑝𝑗 : 𝑠𝐵𝐼𝐶𝑖𝑗 "=" log 𝑃 𝐷𝑛 |𝑤 ෝMLE − 𝜆𝑖𝑗 log 𝑛 + 𝑚𝑖𝑗 − 1 log log 𝑛 . Drton. 2017 19
1. 2. Parameter Region Restriction Outline 3. 4. 5. 20
2. Parameter Region Restriction Motivation • Sometimes, parameter regions are often restricted because of Kohjima. 2016 interpretability. 1. Non-negative restriction 2. Simplex restriction, etc. E.g. Logistic regression of purchase existence for a product. Legend ・TVCM ・DM ・Rating ・#Reviews Non-negative restriction Coefficients Coefficients 21
2. Parameter Region Restriction Motivation What happens when restrictions • Sometimes, parameter regionsare areadded? often restricted because of [14] Kohjima. 2016 interpretability. How does the generalization 1. Non-negative restriction error change? 2. Simplex restriction, etc. E.g. Logistic regression of purchase existence for a product. Legend ・TVCM ・DM ・Rating ・#Reviews Non-negative restriction Coefficients Coefficients 22
2. Parameter Region Restriction Motivation Ref. of Characters http://watanabe-www.math.dis.titech.ac.jp/users/swatanab/ibis20171110.pdf • A future application of clarifying the effect of restriction to generalization is as follows. Statistician/DS Customer 23
Ref. of Characters http://watanabe-www.math.dis.titech.ac.jp/users/swatanab/ibis20171110.pdf 2. Parameter RegionWe Restriction want to know what happens Motivation when the contributions of explanatories restricted to to • A future application of clarifying the effectare of restriction generalization is as follows. non-negative. We need high prediction accuracy and an interpretable model. Statistician/DS Customer 24
Ref. of Characters http://watanabe-www.math.dis.titech.ac.jp/users/swatanab/ibis20171110.pdf 2. Parameter RegionWe Restriction want to know what happens Motivation when the contributions of explanatories restricted to to • A future application of clarifying the effectare of restriction generalization is as follows. non-negative. We need high prediction accuracy and an interpretable model. If the parameter is restricted to non-negative, the prediction performance is reduced by Foo points when n = Bar. To achieve the needed accuracy, we recommend increasing n to Statistician/DS Bar+Baz. Customer 25
2. Parameter Region Restriction Revisiting Analytic Set Corresponding to the maximum pole of the zeta function 𝐶 𝜁 𝑧 = +⋯ 𝑚 𝑧+𝜆 The lines are the set K(w)=0 in the parameter space and the star is the “deepest” singularity. 26
2. Parameter Region Restriction Revisiting Analytic Set When a restriction is added, The lines are the set K(w)=0 in the parameter space and the star is the “deepest” singularity. 27
2. Parameter Region Restriction Revisiting Analytic Set When a restriction is added, The deepest singularity is changed. The lines are the set K(w)=0 in the parameter space and the star is the “deepest” singularity. 28
2. Parameter Region Restriction Revisiting Analytic Set When a restriction is added, The deepest singularity is changed. I.e. The lines are the set K(w)=0 in the parameter space and the star is the “deepest” singularity. 29
2. Parameter Region Restriction Revisiting Analytic Set When a restriction is added, The deepest singularity is changed. I.e. The RLCT and the multiplicity The lines are the set K(w)=0 in the parameter space become different. and the star is the “deepest” singularity. A theoretical simple example is in Appendix. 30
2. Parameter Region Restriction Recent Studies We’ve studied RLCTs of parameter restricted models: NMF(H. 2017a, 2017b, 2020), LDA(H. 2021), and two type CBMs. 31
2. Parameter Region Restriction Recent Studies We’ve studied RLCTs of parameter restricted models: NMF(H. 2017a, 2017b, 2020), LDA(H. 2021), and two type CBMs. Doctorial Thesis Recent Studies (in industry) 32
1. 2. Outline 3. Non-negative and Simplex 4. 5. 33
3. Non-negative and Simplex Restriction
Non-negative matrix factorization
• NMF as a statistical model is formulized as follows.
‒ Data matrices: 𝑋 𝑛 = 𝑋 1 , … , 𝑋 𝑛 ; 𝑀 × 𝑁 × 𝑛
are i.i.d. and subject to 𝑞 𝑋𝑖𝑗 = Poi 𝑋𝑖𝑗 | 𝐴0 𝐵0 𝑖𝑗 .
◼
True factorization: 𝐴; 𝑀 × 𝐻0 , 𝐵; 𝐻0 × 𝑁
‒ Set the model 𝑝 𝑋𝑖𝑗 |𝐴, 𝐵 = Poi 𝑋𝑖𝑗 | 𝐴𝐵 𝑖𝑗 and
the prior 𝜑 𝐴, 𝐵 = ςGam 𝐴𝑖𝑘 |𝜙𝐴 , 𝜃𝐴 Gam 𝐵𝑘𝑗 |𝜙𝐵 , 𝜃𝐵 .
◼
Learner factorization: 𝐴; 𝑀 × 𝐻, 𝐵; 𝐻 × 𝑁
A
n
X
B
𝑃 𝑋, 𝐴, 𝐵 = 𝑃 𝑋 𝐴, 𝐵 𝑃 𝐴 𝑃 𝐵
𝑐 𝑥 𝑒 −𝑐
Poi 𝑥|𝑐 =
𝑐 >0 +𝛿 𝑥 𝑐 =0
𝑥!
𝜃 𝜙 𝜙 −𝜃𝑎
Gam 𝑎|𝜙, 𝜃 =
𝑎 𝑒
Γ 𝜃
34
3. Non-negative and Simplex Restriction
Non-negative matrix factorization
• NMF as a statistical model is formulized as follows.
‒ Data matrices: 𝑋 𝑛 = 𝑋 1 , … , 𝑋 𝑛 ; 𝑀 × 𝑁 × 𝑛
are i.i.d. and subject to 𝑞 𝑋𝑖𝑗 = Poi 𝑋𝑖𝑗 | 𝐴0 𝐵0 𝑖𝑗 .
◼
True factorization: 𝐴; 𝑀 × 𝐻0 , 𝐵; 𝐻0 × 𝑁
‒ Set the model 𝑝 𝑋𝑖𝑗 |𝐴, 𝐵 = Poi 𝑋𝑖𝑗 | 𝐴𝐵 𝑖𝑗 and
the prior Matrix
𝜑 𝐴, 𝐵 X =
Gam 𝐴|𝜙𝐴to
, 𝜃a𝐴 product
Gam 𝐵|𝜙
is factorized
of 𝐵 , 𝜃𝐵 .
◼
Learner factorization:
𝐴; 𝑀 × 𝐻, 𝐵; 𝐻 × 𝑁
two matrices.
𝑵
n
𝑴
A
X 𝑿
B
𝑃𝑯 𝑋, 𝐴, 𝐵 = 𝑃𝑵𝑋 𝐴, 𝐵 𝑃 𝐴 𝑃 𝐵
𝑐 𝑥 𝑒 −𝑐
Poi 𝑨
𝑥|𝑐 =
𝑯
𝑩𝑐 > 0 + 𝛿 𝑥 𝑐 = 0
𝑥!
𝑴
𝜃 𝜙 𝜙 −𝜃𝑎
Gam
𝜃 =2016 𝑎 𝑒
[14]𝑎|𝜙,
Kohjima.
Γ 𝜃
35
3. Non-negative and Simplex Restriction RLCT of NMF • The RLCT of NMF 𝝀 satisfies the following inequality: 𝟏 𝝀≤ 𝑯 − 𝑯𝟎 𝐦𝐢𝐧 𝑴𝝓𝑨 , 𝑵𝝓𝑩 + 𝑯𝟎 𝑴 + 𝑵 − 𝟏 . 𝟐 H. and Watanabe 2017-a The equality holds if 𝑯 = 𝑯𝟎 = 𝟏 or 𝑯𝟎 = 𝟎. H. 2020 • Tighter upper bound is also derived H. and Watanabe 2017-b if 𝝓𝑨 = 𝝓𝑩 = 𝟏. • This result gives a lower bound of variational approximation error in NMF. Kohjima. 2017 H. 2020 36
3. Non-negative and Simplex Restriction Latent Dirichlet Allocation • Situation: ‒ LDA treats a bag of words. ‒ Each document (=word list) has some latent topics. E.g. A mathematics paper is a document. It is expected that there exists “name” topic, “math” topic, etc. In “name” topic, appearance probability of mathematician’s names may be high. NAME document MATH … ◼ topic Riemann, Lebesgue, Atiyah, Hironaka, … word integral, measure, distribution, singularity, … word 37
3. Non-negative and Simplex Restriction Latent Dirichlet Allocation • Documents 𝑧 𝑛 and words 𝑥 𝑛 are observable and topics 𝑦 𝑛 are not. • LDA assumes that words occur given documents. 𝑥𝑛 ∼ 𝑞 𝑥 𝑧 n z y NAME MATH … document topic x estimate Riemann, Lebesgue, Atiyah, Hironaka, … word 𝑝 𝑥, 𝑦 𝑧, 𝑤 integral, measure, distribution, singularity, … word 38
3. Non-negative and Simplex Restriction Latent Dirichlet Allocation Data(=word) generating process of LDA Document 1 NAME Alice FOOD sushi ・ ・ ・ FOOD pudding NAME Riemann MATH integral ・ ・ ・ NAME Lebesgue ・ ・ ・ Document N [13] H. and Watanabe 2020-b 39
3. Non-negative and Simplex Restriction Latent Dirichlet Allocation Data(=word) generating process of LDA Topic 1 Document 1 Topic 2 NAME Alice FOOD sushi ・ ・ ・ FOOD pudding Topic proportion 𝑏𝑗 = 𝑏1𝑗 , … , 𝑏𝐻𝑗 is ・ ・ ・ corresponding to each document. NAME Document N Topic H MATH ・ ・ ・ NAME Riemann integral Lebesgue [13] H. and Watanabe 2020-b 40
3. Non-negative and Simplex Restriction Latent Dirichlet Allocation Data(=word) generating process of LDA Topic 1 Document 1 Topic 2 NAME Alice FOOD sushi ・ ・ ・ FOOD pudding Topic proportion 𝑏𝑗 = 𝑏1𝑗 , … , 𝑏𝐻𝑗 is Word appearance probability ・ ・ ・ corresponding to each document. 𝑎𝑘 = 𝑎1𝑘 , … , 𝑎𝑀𝑘 is NAME Riemannto each topic. corresponding Document N Topic H MATH integral ・ ・ ・ NAME Lebesgue [13] H. and Watanabe 2020-b 41
3. Non-negative and Simplex Restriction Latent Dirichlet Allocation • LDA is formulized as follows. ‒ Onehot-encoded words: 𝑥 𝑛 = 𝑥 1 , … , 𝑥 𝑛 ; 𝑀 × 𝑛 and documents: 𝑧 𝑛 = 𝑧 1 , … , 𝑧 𝑛 ; 𝑁 × 𝑛 are i.i.d. and generated from 𝑞 𝑥|𝑧 𝑞 𝑧 . ‒ Set the model 𝑧 𝑁 𝐻 𝑦𝑘 𝑀 𝑝 𝑥, 𝑦|𝑧, 𝐴, 𝐵 = ෑ ෑ 𝑏𝑘𝑗 ෑ 𝑎𝑖𝑘 𝑗 𝑘 𝑗 𝑥𝑖 𝑖 and the prior 𝜑 𝐴, 𝐵 = ς𝑘 Dir 𝑎𝑘 |𝜙𝐴 ς𝑗 Dir 𝑏𝑘 |𝜙𝐵 . 1 𝑛 ◼ Latent topics: 𝑦 𝑛 = 𝑦 ◼ Stochastic matrices: 𝐴; 𝑀 × 𝐻, 𝐵; 𝐻 × 𝑁, σ𝑘 𝑎𝑖𝑘 = 1, σ𝑗 𝑏𝑘𝑗 = 1. ,…,𝑦 ; 𝐾 × 𝑛. 𝑃 𝑋, 𝑌, 𝐴, 𝐵|𝑍 = 𝑃 𝑋, 𝑌 𝑍, 𝐴, 𝐵 𝑃 𝐴 𝑃 𝐵 ; Dir 𝑐|𝜙 = Γ σ𝐻 𝑘 𝜙𝑘 𝐻 ς𝑘 𝜙𝑘 𝜙𝑘 −1 ς𝐻 𝑐 , σ𝑘 𝑐𝑘 = 1. 𝑘 𝑘 42
3. Non-negative and Simplex Restriction Stochastic Matrix Factorization • In the NMF, consider formally replacing parameter non-negative matrices to stochastic matrices. ‒ A stochastic matrix is defined by a non-negative matrix wherein the sum of the entries in its columns is equal to 1. 0.1 0.1 0.4 0 Example: 0.5 0.1 0.4 0 . 0.4 0.8 0.2 1 • This replaced model is called stochastic matrix factorization (SMF). 43
3. Non-negative and Simplex Restriction Equivalence of LDA and SMF 𝑥𝑧 and 𝐻 𝑤 = 𝑝 𝑥 𝑧, 𝐴, 𝐵 𝐴𝐵 − 𝐴𝑜 𝐵𝑜 2 where 𝑞 𝑥 𝑧 is σ𝑦 𝑝 𝑥, 𝑦 𝑧, 𝐴0 , 𝐵0 . • Let 𝐾 𝑤 = σ𝑧 σ𝑥 𝑞 𝑥 𝑧 𝑞 𝑧 log 𝑞 • It can be proved that 𝐾 𝑤 ∼ 𝐻 𝑤 , i.e. the RLCT of LDA is equal to the one of SMF. 44
3. Non-negative and Simplex Restriction Equivalence of LDA and SMF 𝑥𝑧 and 𝐻 𝑤 = 𝑝 𝑥 𝑧, 𝐴, 𝐵 𝐴𝐵 − 𝐴𝑜 𝐵𝑜 2 where 𝑞 𝑥 𝑧 is σ𝑦 𝑝 𝑥, 𝑦 𝑧, 𝐴0 , 𝐵0 . • Let 𝐾 𝑤 = σ𝑧 σ𝑥 𝑞 𝑥 𝑧 𝑞 𝑧 log 𝑞 • It can be proved that 𝐾 𝑤 ∼ 𝐻 𝑤 , i.e. the RLCT of LDA is equal to the one of SMF. Thus, we only have to consider SMF to determine 𝜆 and 𝑚 of LDA. 45
3. Non-negative and Simplex Restriction RLCT of LDA • Let 𝜆LDA be the RLCT of LDA 𝜆LDA 𝑀, 𝑁, 𝐻, 𝑟 = 𝜆RRR = 𝜆RRR H. 2021. 𝑀−1 𝑀 − 1, 𝑁 − 1, 𝐻 − 1, 𝑟 − 1 + … (1) 2 𝑁 𝑀, 𝑁, 𝐻, 𝑟 − … (2) 2 ‒ where r is the rank of an intrinsic matrix product of the true parameter • Simplex restriction affects the generalization error 46
3. Non-negative and Simplex Restriction RLCT of LDA 𝑛→∞ RLCT lim 𝑛𝔼 𝐺𝑛 • Change # of topics H 𝑑 2 𝜆 H. 2021. • (parameter dimension)/2(Yellow◆): Linearly increasing and unbounded • RLCT of LDA(青●): Non-linearly increasing and upper bounded 𝑑 𝑀−1 𝐻+ 𝐻−1 𝑁 = . 2 2 47
1. 2. Outline 3. 4. Concept Bottleneck Models 5. 48
3. Concept Bottleneck Models Concept-based XAI • Many of learning machines (such as DNNs) are black boxes • Explainable AI (XAI): making them explainable/interpretable How? 49
3. Concept Bottleneck Models Concept-based XAI • Many of learning machines (such as DNNs) are black boxes • Explainable AI (XAI): making them explainable/interpretable How? E.g.: concept-based XAI • Concept: explaining reasons of the output (for the domain) • Ante-hoc: concepts affect learning process (not only test time) 50
3. Concept Bottleneck Models Concept Bottleneck Model (CBM) • Require: input, output, concept x B c Loss: 𝑦 − 𝐴 ∘ 𝐵 𝑥 A 2 y +𝛾 𝑐−𝐵 𝑥 2 Koh, 2020. 51
3. Concept Bottleneck Models Concept Bottleneck Model (CBM) • Require: input, output, concept Koh, 2020. c(concepts) A(coeffs) Black feather 42 Black beak 7.2 …… …… c(concepts) Label Black feather True Black beak True …… …… y(birds) Pred prob Crow 0.8 Pigeon 0.1 …… …… 52
3. Concept Bottleneck Models Multitask Formulation • Require: input, output, concept Xu, 2020. Vertically concatenated vecs c(concepts) Label Black feather True Black beak True …… …… y(birds) Pred prob Crow 0.8 Pigeon 0.1 …… …… 53
3. Concept Bottleneck Models CBM and Multitask 【Both】 • Concept-based XAI • Use concepts in training process 【Either】 • CBM: concepts are inserted in the last connection • Multitask: concepts are concatenated to the output 54
3. Concept Bottleneck Models CBM and Multitask • Prob. Model of CBM True dist. Inference Model • Prob. Model of Multitask True dist. Inference Model 55
3. Concept Bottleneck Models Three-layered and Linear CBM and Multitask • Assumptions ‒ Three-layered and Linear (as same as reduced rank regression) ◼ Cf. SoTA and trained NN -> Three-layered distillated NN (transfer learning) ‒ Real inputs, outputs and concepts and Gaussian noise ◼ Categorical outputs or binary concepts: the following complete version https://arxiv.org/abs/2303.09154 56
3. Concept Bottleneck Models Main Theorem (CBM) • Three-layered and linear CBM is regular ‒ where K-dim concept, M-dim output, and N-dim input Parameter counting 57
3. Concept Bottleneck Models CBM vs Multitask • RLCT of Multitask: that of reduced rank regression ‒ In the case when outputs are categorical, adjust the dimension ‒ where H-dim middle layer, K-dim concept, M-dim output, and N-dim input and r = rank(U0V0) (true rank) • Compare generalization error ‒ Solve 𝜆1 − 𝜆2 > 0 𝜆RRR : RLCT of Reduced Rank Regression 58
3. Concept Bottleneck Models CBM vs Multitask • Fix H and Change K for each true dists. H=3,H0=1 H=6,H0=1 H=6,H0=4 H=9,H0=1 H=9,H0=4 H=9,H0=7 59
3. Concept Bottleneck Models CBM vs Multitask • Fix K and Change H for each true dists. K=3,H0=1 K=6,H0=1 K=6,H0=4 K=9,H0=1 K=9,H0=4 K=9,H0=7 60
3. Concept Bottleneck Models Partial CBM Strong restriction • CBM has larger generalization error by concepts ‒ Empirically, the acc of CBM is less than NN and Multitask • Besides, concept labeling is hard if K is large ‒ For bird classification, color of wings, feather, beaks, … etc. • 61
3. Concept Bottleneck Models Partial CBM Strong restriction • CBM has larger generalization error by concepts ‒ Empirically, the acc of CBM is less than NN • Besides, concept labeling is hard if K is large ‒ For bird classification, color of wings, feather, beaks, … etc. • Partial CBM (PCBM): partialize concept layer Sawada, 2022 Li, 2022 62
3. Concept Bottleneck Models PCBM architecture x x Sawada, 2022 Li, 2022 B c A y 𝑐 ∈ ℝ𝐾 B c' c A y 𝑐 ′ ∈ ℝ𝐾1 , 𝑐 ∈ ℝ𝐾2 , 𝐾1 + 𝐾2 = 𝐾 63
3. Concept Bottleneck Models Three-layered and linear PCBM x B c A y True dist. Inference Model x Sawada, 2022 Li, 2022 B c' c A y 64
3. Concept Bottleneck Models Three-layered and linear PCBM x B c A y True dist. Inference Model x 𝐵 = 𝐵1 ; 𝐵2 B c' c A 𝐴 = 𝐴1 𝐴2 y 2 65
3. Concept Bottleneck Models Three-layered and linear PCBM • Assumptions ‒ Three-layered and linear ‒ Real inputs, outputs and concepts and Gaussian noise (as same as the above CBM case) 66
3. Concept Bottleneck Models Main Theorem (PCBM) • RLCT of three-layered and linear PCBM satisfies the followings: ‒ where r’=rank(A10B10) 67
3. Concept Bottleneck Models Main Theorem (PCBM) • RLCT of three-layered and linear PCBM satisfies the followings: ‒ where r’=rank(A10B10) x B c' c A y 68
3. Concept Bottleneck Models PCBM vs CBM • GE of PCBM≦The UB≦GE of CBM 𝐾2 𝑀+𝑁 2 𝐾1 𝑀+𝑁 ′ 𝜆RRR 𝑀, 𝑁, 𝐾1 , 𝑟 ≤ (RRR is singular) 2 𝐾1 𝑀+𝑁 𝐾2 𝑀+𝑁 𝐾 𝑀+𝑁 UB ≤ + = = 𝜆1 2 2 2 𝐾 𝑀+𝑁 ∴ 𝜆P ≤ = 𝜆1 2 ‒ UB= 𝜆RRR 𝑀, 𝑁, 𝐾1 , 𝑟 ′ + ‒ ‒ ‒ 69
3. Concept Bottleneck Models PCBM vs CBM • GE of PCBM≦The UB≦GE of CBM 𝐾2 𝑀+𝑁 2 𝐾1 𝑀+𝑁 ′ 𝜆RRR 𝑀, 𝑁, 𝐾1 , 𝑟 ≤ (RRR is singular) 2 𝐾1 𝑀+𝑁 𝐾2 𝑀+𝑁 𝐾 𝑀+𝑁 UB ≤ + = = 𝜆1 2 2 2 𝐾 𝑀+𝑁 ∴ 𝜆P ≤ = 𝜆1 2 ‒ UB= 𝜆RRR 𝑀, 𝑁, 𝐾1 , 𝑟 ′ + ‒ ‒ ‒ The generalization error decreases because of replacing partial concepts to unsupervised (the paper of this theorem is working) 70
1. 2. Outline 3. 4. 5. Summary 71
4. Summary • Structure making model interpretable is often represented as parameter restriction. • The CBM’s concept observations is a strong restriction for the NN’s weights; completely separates the weights in the parameter region. • The PCBM’s restriction relaxes separating weights and improves the generalization performance compared with CBM. • Future direction: clarifying RLCTs of the other types of conceptbased XAI from the point of view of singular learning theory. 72