【拡散モデル勉強会】Adding Conditional Control toText-to-Image Diffusion Models

6.7K Views

May 14, 24

#ControlNet #テキストから画像生成 #拡散モデル #条件制御 #Stable Diffusion

スライド概要

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 85.8K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 59.5K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 55.5K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 40K

【拡散モデル勉強会】拡散モデルのサンプラーまとめ

Deep Learning JP 34K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 33.8K

各ページのテキスト

DEEP LEARNING JP [DL Papers] Adding Conditional Control to Text-to-Image Diffusion Models [1] Itsunori Watanabe, Waseda Univerisity http://deeplearning.jp/ 1

http://deeplearning.jp/

サマリー • text-to-imageの⽣成モデルのcontrolを⾏うControlNetを開発(⼀応Stable Diﬀusion向け) • ﬁnetuningの課題だったoverﬁttingや破滅的忘却を起こすことなく，robustな⽣成の条件付けを可能にした ※defaultのpromptは”a high-quality, detailed, and professional image” 2

⽬次 • 背景 • 関連研究 • ⼿法 • 実験結果 3

⽬次 • 背景 • 関連研究 • ⼿法 • 実験結果 4

背景 • ⽣成モデルの条件付けは概して難しい • データ量の不⾜ • Stable Diﬀusionの学習に使われているLAION-5B等のデータセットのスケールに⽐べた，特定の条件付けのためのデータセットは100Kほどのスケール (LAIONの1/50000) • 破滅的な忘却やoverﬁ&ngの可能性 5

⽬次 • 背景 • 関連研究 • ⼿法 • 実験結果 6

Finetuning関連の既存研究 • HyperNetwork[2] • Adapter[3] • Additive Learning[4] • LoRA • Zero-Initialized Layers[5] 7

画像⽣成のdiﬀusionの既存モデル • LDM (Latent Diffusion Models)[2] • Glide[6] • Disco Diffusion[7] • Stable Diffusion[8] 8

画像⽣成モデルのcontrolの既存⼿法 • MakeAScene[9] • SpaText[10] • GLIGEN[11] • Textual Inversion[12] • DreamBooth[13] 9

10.

Image-to-Image translationの関連研究 • PaleXe[14] • PITI[15] 10

11.

⽬次 • 背景 • 関連研究 • ⼿法 • 実験結果 11

12.

⼿法のポイント • Stable Diﬀusionのパラメータは全てfreeze • Stable Diﬀusionのブロックごとにtrainable copyと呼ばれるコピーを作成 • Trainable copyとfreezeされたStable Diﬀusionをzero convolu+on で接続 • Zero convolu+on: 0で初期化された1×1convolu0on 12

13.

ControlNetの基本構成 • 条件 𝑐 をtrainable copyに⼊⼒してzero convolutionで元のモデルに戻す 𝑦! : ControlNetの出⼒ 𝐹: NN block 𝑥: ⼊⼒ 𝜃: lockされたNNのパラメータ 𝑍: zero convolution 𝑐: 条件 𝜃"# : 最初のzero convolutionのパラメータ 𝜃"$ : 最後のzero convolutionのパラメータ 𝜃% : trainable copyのパラメータ 13

14.

zero convolutionによって学習初期のノイズ抑制 𝑦! : ControlNetの出⼒ 𝐹: NN block 𝑥: ⼊⼒ 𝜃: lockされたNNのパラメータ 𝑍: zero convolution 𝑐: 条件 𝜃"# : 最初のzero convolutionのパラメータ 𝜃"$ : 最後のzero convolutionのパラメータ 𝜃% : trainable copyのパラメータ学習初期では０に等しい 14

15.

Stable Diﬀusionへの組み込み • 12のencoding blockと1つのmiddle blockをコピー • 12のskip connectionと1つのmiddle blockに conditioningの結果をzero convolutionで追加 • Stable Diffusion⾃体の最適化に⽐べて23%のVRAM と34%の計算時間で学習が可能 • Stable Diffusionでは512×512のinput画像を64×64 の潜在空間にconvetする前処理があり， conditioning部分においても同様の処理を施して conditionの⼊⼒𝑐! を得る 15

16.

プロンプトの削除とsudden convergence phenomenon • プロンプトの削除 • 50%のプロンプトは空⽂字列に変換 • Conditioningをダイレクトに認識しやすくするため • sudden convergence phenomenon • 徐々にcondiConを学習するわけではなく，突如 conﬁConの認識性能が発現することを確認 (通常は 10K opCmizaCon steps以内に発現) 16

17.

⽬次 • 背景 • 関連研究 • ⼿法 • 実験結果 17

18.

Promptingとcontrol⽅法の違いによる⽣成結果の違い • Control⽅法 • 前述のControlNet - ( a ) • ControlNet w/o zero convolution – ( b ) • trainable copyをconvolution１層に置き換えたControlNet-lite – ( c ) • Prompt • No Prompt • Insufficient Prompt: 画像内のObjectについて⾔及しない • Conflicting Prompt: Control画像の意味を改変 • Perfect Prompt: 画像内のObjectとControlの意味を完全に指定 18

19.

Promptingとcontrol⽅法の違いによる⽣成結果の違い 19

20.

⼈間の評価 • User study • 20のcontrolとpromptから5つのモデルで計100枚の画像を⽣成 • 12⼈のuserが100枚の画像を「質」と「(指⽰再現の)忠実性」から5段階でランクづけ(1が最低, 5が最⾼)．モデルごとにランクの平均を AHR(Average Human Ranking)として集計 • ⼤規模モデルとのControlNetの⽣成画像の分類 • SDv2-D2I(Stable Diffusion V2 Depth-to-Image)とControlNetの⽣成画像を 12⼈のuserがどれだけ正確に分類できるかを検証．平均正解率は52% で両モデルの差異はほとんどないことを確認 ※SDv2-D2Iは12Mの画像で，A100クラスターを⽤いて数千時間学習．ControlNetは200kの画像で，RTX3090Tiを⽤いて5⽇間で学習 20

21.

IoUを⽤いた忠実性の定量的評価 • ADE20K(semantic segmentationのデータセット)を⽤いた評価 • SoTAなsegmentationモデル(OneFormer)を使⽤して，ground truthと segmentationでcontrolした⽣成画像のsegmentation結果をIoUで⽐較 21

22.

その他の指標による既存モデルとの⽐較 22

23.

Appendix • データセットのサイズと⽣成画像の関係 • 曖昧なcontrolからの意味の推定 • 同じcontrolを他のモデルに適⽤した際の⽣成画像 23

24.

参考⽂献1 [1] Zhang, Lvmin, Anyi Rao, and Maneesh Agrawala. 2023. “Adding Conditional Control to Text-to-Image Diffusion Models.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/2302.05543. [2] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ̈orn Ommer. Highresolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022. [3] Chong Mou, Xintao Wang, Liangbin Xie, Jian Zhang, Zhon- gang Qi, Ying Shan, and Xiaohu Qie. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023. [4] Jeffrey O. Zhang, Alexander Sax, Amir Zamir, Leonidas J. Guibas, and Jitendra Malik. Side-tuning: Network adapta- tion via additive side networks. In European Conference on Computer Vision (ECCV), pages 698–714. Springer, 2020. [5] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4401–4410, 2019. [6] Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. 2022. [7] Jeffrey O. Zhang, Alexander Sax, Amir Zamir, Leonidas J. Guibas, and Jitendra Malik. Side-tuning: Network adapta- tion via additive side networks. In European Conference on Computer Vision (ECCV), pages 698–714. Springer, 2020. 24

http://arxiv.org/abs/2302.05543

25.

参考⽂献2 [8] Stability. Stable diffusion v1.5 model card, https://huggingface.co/runwayml/stable-diffusion-v1-5, 2022. [9] Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene- based text-to-image generation with human priors. In Euro- pean Conference on Computer Vision (ECCV), pages 89–106. Springer, 2022. [10] Omri Avrahami, Thomas Hayes, Oran Gafni, Sonal Gupta, Yaniv Taigman, Devi Parikh, Dani Lischinski, Ohad Fried, and Xi Yin. Spatext: Spatio-textual representation for con- trollable image generation. arXiv preprint arXiv:2211.14305, 2022. [11] Yuheng Li, Haotian Liu, Qingyang Wu, Fangzhou Mu, Jian- wei Yang, Jianfeng Gao, Chunyuan Li, and Yong Jae Lee. Gligen: Open-set grounded text-to-image generation. 2023. [12] Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image genera- tion using textual inversion. arXiv preprint arXiv:2208.01618, 2022. [13] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven gen- eration. arXiv preprint arXiv:2208.12242, 2022. [14] Chitwan Saharia, William Chan, Huiwen Chang, Chris Lee, Jonathan Ho, Tim Salimans, David Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. In ACM SIGGRAPH 2022 Conference Proceedings, SIGGRAPH ’22, New York, NY, USA, 2022. Association for Computing Ma- chinery. [15] Tengfei Wang, Ting Zhang, Bo Zhang, Hao Ouyang, Dong Chen, Qifeng Chen, and Fang Wen. Pretraining is all you need for image-to-image translation. 2022. 25