116 Views
July 03, 22
スライド概要
大阪大学医学系研究科 特任助教
Privacy-Preserving Deep Learning R. Shokri and V. Shmatikov. Privacy-preserving deep learning. In CCS, pages 1310–1321. ACM, 2015.
Abstract • deep learningは画像認識、音声認識、テキスト認識など複雑なタスクにおい て、非常に高い精度を記録している • 訓練データの量は、deep learningの精度に直結するため、多くの企業が大規 模なユーザーデータを収集している • 大規模なデータ収集は、ユーザーのプライバシーに関するデータも含まれて おり、その取扱いが課題となる • 病院が保持する医療データなどはプライバシー保護の観点から中央管理が困 難 • 本論文では、複数の関係者が保持する入力データを共有することなく、精度 の高いneural networkを構築する手法を提案
Introduction • deep learningの発展は、音声、画像、テキスト認識、言語翻訳などのような 長年のタスクにおける突破口をもたらした • Google, Facebook, Apple などの企業はユーザーから大量の訓練データを収集し ており、非常に優位である • 大規模なデータ収集は、ユーザーのプライバシーに関するデータも含まれて おり、その取扱いが課題となる
Introduction • deep learningのための訓練データの収集とプライバシーの問題は多くの場合 に対立しており、解決が困難 • 医療分野においてもそれは同様であり、個人のデータを共有することは、法 律または規制により許可されていない • 結果として、バイオ研究や臨床研究は各機関が保有するデータセットでしか 学習することができない • 単一拠点のデータは過学習の原因となり、他の拠点のデータと入力すると精 度が低下してしまう
Introduction • Recent advances in deep learning methods based on artificial neural networks have led to breakthroughs in long-standing AI tasks such as speech, image, and text recognition, language translation, etc. • Companies such as Google, Facebook, and Apple take advantage of the massive amounts of training data collected from their users. • • Massive data collection required for deep learning presents obvious privacy issues. Users’ personal, highly sensitive data such as photos and voice recordings is kept indefinitely by the companies that collect it.
Motivation • 入力データを共有せずに、複数の関係者が精度の高いneural networkを共同で 学習できるようにする実用的なシステムを実装し、評価する • 確率的勾配降下法に基づくdeep learningは、並列化して非同期に実行可能 • このシステムは、参加者が自分のデータセットについて独自に訓練を行い、 訓練中にモデルの重要なパラメータを選択的に共有することを可能とする • 参加者はデータのプライバシーを保護しながら、他の参加者のモデルの恩恵 を受けているため、結果として学習の精度が向上する • システムの肝は、訓練中の「モデルパラメータの選択的共有」
Introduction • most notably those related to medicine, the sharing of data about individuals is not permitted by law or regulation. • Consequently, biomedical and clinical researchers can only perform deep learning on the datasets belonging to their own institutions. • It is well-known that neural-network models become better as the training datasets grow bigger and more diverse. data owned by a single organization may be very homogeneous, producing an overfitted model that will be inaccurate when used on other inputs. •
Motivation • implement, and evaluate a practical system that enables multiple parties to jointly learn an accurate neural network model for a given objective without sharing their input datasets. • deep learning based on stochastic gradient descent, can be parallelized and executed asynchronously. • system enables participants train independently on their own datasets and selectively share small subsets of their models’ key parameters during training. • participants preserve the privacy of their respective data while still benefitting from other participants’ models and thus boosting their learning accuracy.
Motivation • Our key technical innovation is the selective sharing of model parameters during training. • • We experimentally evaluate our system on two datasets, MNIST and SVHN The accuracy of the models produced by the distributed participants in our system is close to the centralized, privacy-violating case where a single party holds the entire dataset and uses it to train the model.
Stochastic gradient descent (SGD) • Stochastic gradient descent (SGD) is a drastic simplification which computes the gradient over an extremely small subset (mini-batch) of the whole dataset. • Let w be the flattened vector of all parameters in a neural network, composed of W𝑘 . Let 𝐸 be the error function, i.e., the difference between the true value of the objective function and the computed output of the network. • The back-propagation algorithm computes the partial derivative of 𝐸 with respect to each parameter in w and updates the parameter so as to reduce its gradient. The update rule of stochastic gradient descent for a parameter 𝑤𝑗 is • where 𝛼 is the learning rate and 𝐸𝑖 is computed over the mini-batch 𝑖.
Stochastic gradient descent (SGD) • 確率勾配降下法(SGD)は、データセット全体のうちのごく小さいサブセッ ト(ミニバッチ)にわたって勾配を計算する簡略化した手法 • neural networkの𝑘層目の重み行列をW𝑘 ,その任意のニューロンに関する重み ベクトルをwとする。 wの𝑗番目の重みを𝑤𝑗 とする。 • 𝐸を任意の損失関数とおくと、𝑤𝑗 は下式により更新される。 𝑤𝑗 ← 𝑤𝑗 − 𝛼 • 𝜕𝐸𝑖 ⋯ (1) 𝜕𝑤𝑗 𝛼は任意の学習率。𝐸i はミニバッチ𝑖毎に更新される。
Distributed Selective SGD • The core of our approach 1. 2. 3. updates to different parameters during gradient descent are inherently independent different training datasets contribute to different parameters different features do not contribute equally to the objective function
Selective parameter update • Selective SGDのポイントは「重みを選択的に更新」すること • 選択基準は、「局所解から離れた重み」の選択。 • (これにより、学習を効率に進められる) 「局所解から離れた重み」=「勾配が大きな重み」 • SGDでは 𝜕𝐸𝑖 𝜕𝑤𝑗 (𝑤𝑗 の勾配)を計算しており、これを選択に用いる 𝜕𝐸𝑖 • wの重みベクトル • w𝑠 を通常通り更新する。(w𝑠ҧ は更新しない) • 𝜃は全ての重みのうち、更新する重みを選択する割合(ハイパーパラメー タ) 𝜕𝑤𝑗 のうち、勾配が大きい𝜃の重みw𝑠 を選択する
Selective parameter update • some parameters contribute much more to the neural network’s objective function and thus undergo much bigger updates during a given iteration of training. • The gradient value depends on the training sample (mini-batch) and varies from one sample to another. • Moreover, some features of the input data are more important than others, and the parameters that help compute these features are more crucial in the process of learning and undergo bigger changes.
Selective parameter update • • In selective SGD, the learner chooses a fraction of parameters to be updated at each iteration. a smart strategy is to select the parameters whose current values are farther away from their local optima, i.e., those that have a larger gradient. 𝜕𝐸 • For each training sample 𝑖, compute the partial derivative 𝑖 for all parameters 𝑤𝑗 𝜕𝑤𝑗 as in SGD. • Let 𝑆 be the indices of 𝜃 parameters with the largest • • • Finally, update the parameter vector w𝑆 in the same way. so the parameters not in 𝑆 remain unchanged. We refer to the ratio of 𝜃 over the total number of parameters as the parameter selection rate. 𝜕𝐸𝑖 𝜕𝑤𝑗 values.
Distributed collaborative learning • Distributed selective SGD assumes two or more participants training independently and concurrently. • After each round of local training, participants asynchronously share with each other the gradients they computed for some of the parameters. • Each participant fully controls which gradients to share and how often. • The sum of all gradients computed for a given parameter determines the magnitude of the global descent towards the parameter’s local optima.
Distributed collaborative learning
System Architecture • Each participant initializes the parameters and then runs the training on his own dataset. • participants upload the gradients of selected neural-network parameters to the parameter server and download the latest parameter values at each local SGD epoch.
Local training 1. participant downloads a 𝜃𝑑 fraction of parameters from the server and overwrites his local parameters. 2. participant runs one epoch of SGD training on his local dataset. 3. the participant computes Δw (𝑖) , the vector of changes in all parameters in step 2. 𝜕𝐸𝑖 𝑤𝑗 ← 𝑤𝑗 − 𝛼 ⋯ (1) 𝜕𝑤𝑗
Local training (𝑖) Δw𝑆 , 4. upload which is selected by following criteria. • to select exactly 𝜃𝑢 fraction of values, picking big values that significantly contribute to the gradient descent algorithm. • to select a random subset of values that are larger than threshold 𝜏. 𝜕𝐸𝑖 𝑤𝑗 ← 𝑤𝑗 − 𝛼 ⋯ (1) 𝜕𝑤𝑗
Parameter server • The parameter server initializes the parameter vector w (𝑔𝑙𝑜𝑏𝑎𝑙) and then handles the participants’ upload and download requests. • When someone uploads gradients, the server adds the uploaded Δw𝑗 value to the corresponding global parameters. • participants obtain from the server the latest values of the parameters. • Each participant decides what fraction of these parameters to download by setting 𝜃𝑑
Evaluation Convergence of SSGD for different mini-batch sizes. • These results confirm the intuition behind SSGD: by sharing only a small fraction of gradients at each gradient descent step, we can achieve almost the same accuracy as SGD.
Evaluation Accuracy of DSSGD for different gradient selection criteria • fewer than the 𝜃𝑢 fraction of gradients may be uploaded, thus accuracy is sometimes lower.
Conclusions • We proposed a new distributed training technique, based on selective stochastic gradient descent. • it can help bring the benefits of deep learning to domains where data owners are precluded from sharing their data by confidentiality concerns.