[DL輪読会]Learning What and Where to Draw (NIPS’16)

>100 Views

February 15, 17

#deep learning #Deep Learning #GAN #GAWWN #Image Generation #Computer Vision

スライド概要

2017/2/15
Deep learning JP:
http://deeplearning.jp/seminar-2/

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 87.4K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 59.9K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 58.5K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 41.4K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 37.9K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 37.5K

各ページのテキスト

論⽂輪読 Learning What and Where to Draw (NIPS’16) 2017/1/20 1

書誌情報 • Learning What and Where to Draw • Scott Reed (Google), Zeynep Akata (MPI), Santosh Mohan (umich), Samuel Tenka (umich), Bernt Schiele (MPI), Honglak Lee (umich) • NIPS‘16 (Conference Event Type: Poster) • https://papers.nips.cc/paper/6111-learning-what-and-where-to-draw 2017/1/20 2

https://papers.nips.cc/paper/6111-learning-what-and-where-to-draw

c.f. Generative Adversarial Text to Image Synthesis • ICML’16 • http://www.slideshare.net/mmisono/generative-adversarial-text-toimage-synthesis 2017/1/20 3

http://www.slideshare.net/mmisono/generative-adversarial-text-to-

2017/1/20 4

2017/1/20 5

Generative Adversarial What-Where Network (GAWWN) • 「なに」を「どこ」に描くか指定する GAN ⽂章 2017/1/20 bonding box / keypoint 6

Bounding-box-conditional text-to-image model 1. text embeddingをM x M x T に変換 2. bounding boxに合うように正規化. 周りは0で埋める 0でマスク 2017/1/20 MxMxT 0でマスク 7

Keypoint-conditional text-to-image model Key Pointはグリッド座標で指定それぞれがhead, left foot, などに対応 2017/1/20 8

Conditional keypoint generation model • 全てのキーポイントを⼊⼒するのは⾯倒 • 今回の実験では，⿃は15個のキーポイントを持つ • ここではConditional GANでキーポイントを⽣成 • キーポイント : • x,y : 座標, v: visible flag • v = 0 なら x = y = 0 • Generator: • Dは 2017/1/20 s: ユーザが指定したキーポイントに対応する箇所が1 を1, 合成したものを0とするよう学習 9

10.

Experiments : Dataset • USB Birds dataset • 200種類の⿃，11,788 枚の画像 • 1枚の画像に10のキャプション, 1つのbounding box, 15のkeypoints • MHP • 25k image, 410種類の動作 • 各画像3キャプション • 複数⼈が写っている画像を除くと19k 2017/1/20 10

11.

Experiments : Misc • text encoder : char-CNN-GRU • Generative Adversarial Text To Image Synthesisと多分同じ • Solver: Adam • Batchsize 16 • Learning rate 0.0002 • 実装 : torch • spatial transform: https://github.com/qassemoquab/stnbhwd • loosely based on dcgan.torch 2017/1/20 11

https://github.com/qassemoquab/stnbhwd

12.

Conditional bird location via bounding boxes 2017/1/20 ・背景は似ている3つの画像で同じではない textとnoiseは3つとも同じ・bounding boxが変わっても⿃の向きは同じ・zは背景や向きなど制御できない情報を担当しているのでは 12

13.

Conditional individual part locations via keypoints ・keypoints は ground truthに固定 (合成でない) ・noiseは各例で別 2017/1/20 ・keypointsはnoiseに対してinvaliant ・背景等はnoiseで変化 13

14.

Using keypoints condition 2017/1/20 ・くちばしと尻尾を指定・全ての⿃が左を向いている (c.f. condition on bounding box) 14

15.

Generating both bird keypoints and images from text alone 2017/1/20 ・textだけからkeypointsを⽣成，その後画像⽣成・全部keypointsを⽣成するようにすると質は下がる 15

16.

先⾏研究との⽐較 2017/1/20 ・先⾏研究はtextはほぼ正確に捉えているものの，くちばちなどが⽋けることがある (64x64) ・提案⼿法は128x128でほぼ正確な画像を⽣成 16

17.

Generating Human 2017/1/20 ・⿃より質が下がる 17 ・textが似ているものが少ない，複雑なポーズは難しい (ヨガぐらいならまぁまぁできてる)

18.

まとめ • GAWWN : bounding boxとkey pointsでどこに描くかを条件付け • CUB datasetでは128x128で質の⾼い画像が⽣成可能 • Future work • 物体の位置を unsupervised or weekly supervised な⽅法で学習 • better text-to-human generation 2017/1/20 18

19.

所感 • 「どこ」の情報をどうエンコードするか，という点が新しい • bounding box • keypoints • ⽂章だけだと任意性が⾼すぎる．位置情報を与えてあげることで画像が⽣成しやすくなる • 細かいネットワーク構成に関しては，なぜそういう設計にしたか説明がないため不明 • もう少し何か理論的根拠が欲しいところ 2017/1/20 19