[DL輪読会]The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision

1.

DEEP LEARNING JP [DL Papers] The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision Kazuki Fujikawa, DeNA http://deeplearning.jp/ 1

http://deeplearning.jp/

2.

サマリ • 書誌情報 – The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision • ICLR2019 • Jiayuan Mao, Chuang Gan, Pushmeet Kohli, Joshua B. Tenenbaum, Jiajun Wu • 概要 – Visual QAの問題に対するEnd-to-End学習の中で、物体のコンセプトやロジックの認識を分離して学習する枠組みを提案 • 教師データは質問と回答のペアのみ必要とする – 実験で提案手法の以下の特性を示した • データ効率が良いアルゴリズムであり、少量データで高精度に到達することを実験で示した • 単に回答を出力するのではなく、回答に至るプロセスを明示できることを示した 2

3.

アウトライン • 背景 • 関連研究 • 提案手法 • 実験・結果 3

4.

アウトライン • 背景 • 関連研究 • 提案手法 • 実験・結果 4

5.

背景 • 物体に紐づくコンセプト（色・形などの属性）を認識することは重要 – 人間がVQAの複雑な質問に答える場合、コンセプト情報とロジック（カウント作業など）を分離して考える Published as a conference paper at ICLR 2019 – 機械学習モデルも同様で、コンセプト情報とロジックを分離して学習・出力できると、データ効率・解釈性の面で改善できる可能性がある I. Learning basic, object-based concepts. Q: What’s the color of the object? A: Red. Q: Is there any cube? A: Yes. Q: What’s the color of the object? A: Green. Q: Is there any cube? A: Yes. II. Learning relational concepts based on referential expressions. Q: How many objects are right of the red object? A: 2. Q: How many objects have the same material as the cube? A: 2 III. Interpret complex questions from visual cues. Q: How many objects are both right of the green cylinder and have the same material as the small blue ball? A: 3 Figure 1: Humans learn visual concepts, words, and semantic parsing jointly and incrementally. I . Learning visual concepts (red vs. green) starts from looking at simple scenes, reading simple questions, and reasoning over contrastive examples (Fazly et al., 2010). I I . Afterwards, we can interpret referential expressions based on the learned object-based concepts, and learn relational concepts (e.g., on the right of, the same material as). I I I Finally, we can interpret complex questions from visual cues by exploiting the compositional structure. 5

6.

アウトライン • 背景 • 関連研究 • 提案手法 • 実験・結果 6

7.

関連研究 Published as a conference paper at ICLR 2019 Published as a conference paper at ICLR 201 Published as a conference paper at ICLR 2019 • 関連研究と本研究の位置付け A. Curriculum concept learning End-to-End 入力1: 画像データ A. Curriculum concept learning A. Curriculum B. Illustrative concept learning execution of NS B. Illustrative execution of NS-CL Programを介するアプローチ Initialized with DSL and executor. 本研究 Q: Does Initialized the red object with DSL left ofand theexecutor green Initialized with DSL and executor. Q: Does the red object left of the green cube have the same shape as the Lesson1: Object-based Lesson1: Object-based questions cube have the入力2: same shape as the questions. 入力1: 画像データ質問文入力1: 画像データ purple 質問文 matte入力2: thing? Lesson1: questions. 入力2:Object-based 質問文 purple matte thing? Q: What is theis shape of the red object? Q: What isParsing theis shape of the red object? Q: What the shape Q: What the shape Step1: Visual of the red object ? of the red object ? Q: What is the shape Q: What is the shape of the red object? Cube. Parsing A: Cube. Step1:A:Visual 1 Obj 1 of the red object ? A: Cube. 1 Obj 1 2 Obj 2 Lesson2: Relational questions. questions.3 Obj 2 Obj 3Lesson2: Relational NN NN NN NN 4 3 Lesson2: Relational questions. Obj 3 Obj 4 Q: How many cubes are behind the Obj 4 Q: How many cubes are behind the Q: How many cubes are behind the Step2, 3: Semantic Parsing and Progra 中間出力2: プログラム中間出力2: プログラム sphere? 中間出力1: コンセプト中間出力1: ベクトル sphere? Step2, 3: Semantic Parsing and Program Execution sphere? NN A: 3 ID Color Shape A: 3 Representations Concep Obj:1 Green Program Filter(Red) A: 3 Filter(Red) Program Representations Concepts Outputs ↓ complex questions. ↓ complex question 1 Green Cube Lesson3: More Lesson3: More Red Filter Obj:2 Lesson3: More complex questions. Green Cu Query(Shape) Query(Shape) 出力: 回答 Q: Does the red object left of the green Q: Does the red object left of the green 2 Red SphereFilter Green Cube Q: Does the red object left of the green cube have the same shape as the cube have the same shape as the A: Box Object 2 cube have the same shape as the Relate purple matte thing? purple matte thing? Left Object 2 出力: 回答出力: 回答 Relate purple matte thing? Left A: No A: No NN NN A: Box A: Box A: No Filter Red Hudson+ 2018, Mascharka+ 2018, etc. Yi+ 2018 Filter Deploy: complex scenes, complex questions Deploy: complex scenes, complex Red モジュール分離解釈性教師データ Deploy: complex scenes, complex questions × ○ Q: Does the matte thing behind the big Q: Does the matte thing behind the big sphere have the same color Purple as the Matte Filter sphere have the same color as the △ ○ cylinder left of the small matte cube? cylinder left of the small matte cube? A: No. 1 Object 3 画像 → コンセプト、質問文 → プログラム AEQuery Object A: No. Shape 画像, 質問文 → 回答コンセプト, プログラム → 回答 ○ Q: Does the matte thing behind the big Filter sphere have the same color asPurple the M ○ cylinder left of the small matte cube? Object 3 AEQuery A: No. Object 1 Shape No (0.98) 画像, 質問文 → 回答 A.visual Demonstration thecurriculum Figure4: learning A. of Demonstration visual concepts,ofwords, thecurriculum 7 and sema Figure4: A. Demonstration of thecurriculumFigure4: learning of concepts, of words, and semantic parsing of sentences by watching images and reading of paired sentences questions by watching and answers. imagesScenes and reading and

8.

アウトライン • 背景 • 関連研究 • 提案手法 • 実験・結果 8

9.

提案手法: Joint Learning of Concepts and Semantic Parsing • 1. 画像・コンセプト表現空間を教師あり学習（Program出力部は固定）入力1: 画像データ入力2: 質問文 ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 Q: What is the shape of the red object ? ② 質問文からEncoder-Decoder（BiGRU-GRU）ベースの手法 [Dong+, 2016]でProgramを出力 BiGRU-GRU Mask R-CNN ③ Programの1行目の処理に必要なConceptのEmbedding （Color embedding）を獲得 Filter(Red) ↓ Query(Shape) 1 ④ Filter処理を実行（RedとのCosine類似度が最大となるObjに限定） ⑤ Programの2行目の処理に必要なConceptのEmbedding （Shape embedding）を獲得 Red 2 NN ⑥ Query処理を実行（Obj: 2とのCosine類似度が最大となるShapeを獲得）し、予測結果として出力 ResNet-34 Color Embedding Space Obj:1 ⑦ 正解データとの誤差を逆伝播し、Embedding Spaceを更新 Cylinder NN 出力: 回答 Obj:2 Box Visual Feature Space 予測 Sphere Shape Embedding Space A: Box BP 正解: Sphere 9

10.

提案手法: Joint Learning of Concepts and Semantic Parsing • 1. 画像・コンセプト表現空間を教師あり学習（Program出力部は固定）入力1: 画像データ入力2: 質問文 ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 Q: What is the shape of the red object ? ② 質問文からEncoder-Decoder（BiGRU-GRU）ベースの手法 [Dong+, 2016]でProgramを出力 BiGRU-GRU Mask R-CNN ③ Programの1行目の処理に必要なConceptのEmbedding （Color embedding）を獲得 Filter(Red) ↓ Query(Shape) 1 ④ Filter処理を実行（RedとのCosine類似度が最大となるObjに限定） ⑤ Programの2行目の処理に必要なConceptのEmbedding （Shape embedding）を獲得 Red 2 NN ⑥ Query処理を実行（Obj: 2とのCosine類似度が最大となるShapeを獲得）し、予測結果として出力 ResNet-34 Color Embedding Space Obj:1 ⑦ 正解データとの誤差を逆伝播し、Embedding Spaceを更新 Cylinder NN 出力: 回答 Obj:2 Box Visual Feature Space 予測 Sphere Shape Embedding Space A: Box BP 正解: Sphere 10

11.

提案手法: Joint Learning of Concepts and Semantic Parsing • 1. 画像・コンセプト表現空間を教師あり学習（Program出力部は固定）入力1: 画像データ入力2: 質問文 ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 Q: What is the shape of the red object ? ② 質問文からEncoder-Decoder（BiGRU-GRU）ベースの手法 [Dong+, 2016]でProgramを出力 BiGRU-GRU Mask R-CNN ③ Programの1行目の処理に必要なConceptのEmbedding （Color embedding）を獲得 Filter(Red) ↓ Query(Shape) 1 ④ Filter処理を実行（RedとのCosine類似度が最大となるObjに限定） ⑤ Programの2行目の処理に必要なConceptのEmbedding （Shape embedding）を獲得 Red 2 NN ⑥ Query処理を実行（Obj: 2とのCosine類似度が最大となるShapeを獲得）し、予測結果として出力 ResNet-34 Color Embedding Space Obj:1 ⑦ 正解データとの誤差を逆伝播し、Embedding Spaceを更新 Cylinder NN 出力: 回答 Obj:2 Box Visual Feature Space 予測 Sphere Shape Embedding Space A: Box BP 正解: Sphere 11

12.

提案手法: Joint Learning of Concepts and Semantic Parsing • 1. 画像・コンセプト表現空間を教師あり学習（Program出力部は固定）入力1: 画像データ入力2: 質問文 ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 Q: What is the shape of the red object ? ② 質問文からEncoder-Decoder（BiGRU-GRU）ベースの手法 [Dong+, 2016]でProgramを出力 BiGRU-GRU Mask R-CNN ③ Programの1行目の処理に必要なConceptのEmbedding （Color embedding）を獲得 Filter(Red) ↓ Query(Shape) 1 ④ Filter処理を実行（RedとのCosine類似度が最大となるObjに限定） ⑤ Programの2行目の処理に必要なConceptのEmbedding （Shape embedding）を獲得 Red 2 NN ⑥ Query処理を実行（Obj: 2とのCosine類似度が最大となるShapeを獲得）し、予測結果として出力 ResNet-34 Color Embedding Space Obj:1 ⑦ 正解データとの誤差を逆伝播し、Embedding Spaceを更新 Cylinder NN 出力: 回答 Obj:2 Box Visual Feature Space 予測 Sphere Shape Embedding Space A: Box BP 正解: Sphere 12

13.

提案手法: Joint Learning of Concepts and Semantic Parsing • 1. 画像・コンセプト表現空間を教師あり学習（Program出力部は固定）入力1: 画像データ入力2: 質問文 ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 Q: What is the shape of the red object ? ② 質問文からEncoder-Decoder（BiGRU-GRU）ベースの手法 [Dong+, 2016]でProgramを出力 BiGRU-GRU Mask R-CNN ③ Programの1行目の処理に必要なConceptのEmbedding （Color embedding）を獲得 Filter(Red) ↓ Query(Shape) 1 ④ Filter処理を実行（RedとのCosine類似度が最大となるObjに限定） ⑤ Programの2行目の処理に必要なConceptのEmbedding （Shape embedding）を獲得 Red 2 NN ⑥ Query処理を実行（Obj: 2とのCosine類似度が最大となるShapeを獲得）し、予測結果として出力 ResNet-34 Color Embedding Space Obj:1 ⑦ 正解データとの誤差を逆伝播し、Embedding Spaceを更新 Cylinder NN 出力: 回答 Obj:2 Box Visual Feature Space 予測 Sphere Shape Embedding Space A: Box BP 正解: Sphere 13

14.

提案手法: Joint Learning of Concepts and Semantic Parsing • 1. 画像・コンセプト表現空間を教師あり学習（Program出力部は固定）入力1: 画像データ入力2: 質問文 ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 Q: What is the shape of the red object ? ② 質問文からEncoder-Decoder（BiGRU-GRU）ベースの手法 [Dong+, 2016]でProgramを出力 BiGRU-GRU Mask R-CNN ③ Programの1行目の処理に必要なConceptのEmbedding （Color embedding）を獲得 Filter(Red) ↓ Query(Shape) 1 ④ Filter処理を実行（RedとのCosine類似度が最大となるObjに限定） ⑤ Programの2行目の処理に必要なConceptのEmbedding （Shape embedding）を獲得 Red 2 NN ⑥ Query処理を実行（Obj: 2とのCosine類似度が最大となるShapeを獲得）し、予測結果として出力 ResNet-34 Color Embedding Space Obj:1 ⑦ 正解データとの誤差を逆伝播し、Embedding Spaceを更新 Cylinder NN 出力: 回答 Obj:2 Box Visual Feature Space 予測 Sphere Shape Embedding Space A: Box BP 正解: Sphere 14

15.

提案手法: Joint Learning of Concepts and Semantic Parsing • 1. 画像・コンセプト表現空間を教師あり学習（Program出力部は固定）入力1: 画像データ入力2: 質問文 ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 Q: What is the shape of the red object ? ② 質問文からEncoder-Decoder（BiGRU-GRU）ベースの手法 [Dong+, 2016]でProgramを出力 BiGRU-GRU Mask R-CNN ③ Programの1行目の処理に必要なConceptのEmbedding （Color embedding）を獲得 Filter(Red) ↓ Query(Shape) 1 ④ Filter処理を実行（RedとのCosine類似度が最大となるObjに限定） ⑤ Programの2行目の処理に必要なConceptのEmbedding （Shape embedding）を獲得 Red 2 NN ⑥ Query処理を実行（Obj: 2とのCosine類似度が最大となるShapeを獲得）し、予測結果として出力 ResNet-34 Color Embedding Space Obj:1 ⑦ 正解データとの誤差を逆伝播し、Embedding Spaceを更新 Cylinder NN 出力: 回答 Obj:2 Box Visual Feature Space 予測 Sphere Shape Embedding Space A: Box BP 正解: Sphere 15

16.

提案手法: Joint Learning of Concepts and Semantic Parsing • 1. 画像・コンセプト表現空間を教師あり学習（Program出力部は固定）入力1: 画像データ入力2: 質問文 ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 Q: What is the shape of the red object ? ② 質問文からEncoder-Decoder（BiGRU-GRU）ベースの手法 [Dong+, 2016]でProgramを出力 BiGRU-GRU Mask R-CNN ③ Programの1行目の処理に必要なConceptのEmbedding （Color embedding）を獲得 Filter(Red) ↓ Query(Shape) 1 ④ Filter処理を実行（RedとのCosine類似度が最大となるObjに限定） ⑤ Programの2行目の処理に必要なConceptのEmbedding （Shape embedding）を獲得 Red 2 NN ⑥ Query処理を実行（Obj: 2とのCosine類似度が最大となるShapeを獲得）し、予測結果として出力 ResNet-34 Color Embedding Space Obj:1 ⑦ 正解データとの誤差を逆伝播し、Embedding Spaceを更新 Cylinder NN 出力: 回答 Obj:2 Box Visual Feature Space 予測 Sphere Shape Embedding Space A: Box BP 正解: Sphere 16

17.

提案手法: Joint Learning of Concepts and Semantic Parsing • 1. 画像・コンセプト表現空間を教師あり学習（Program出力部は固定）入力1: 画像データ入力2: 質問文 ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 Q: What is the shape of the red object ? ② 質問文からEncoder-Decoder（BiGRU-GRU）ベースの手法 [Dong+, 2016]でProgramを出力 BiGRU-GRU Mask R-CNN ③ Programの1行目の処理に必要なConceptのEmbedding （Color embedding）を獲得 Filter(Red) ↓ Query(Shape) 1 ④ Filter処理を実行（RedとのCosine類似度が最大となるObjに限定） ⑤ Programの2行目の処理に必要なConceptのEmbedding （Shape embedding）を獲得 Red 2 NN ⑥ Query処理を実行（Obj: 2とのCosine類似度が最大となるShapeを獲得）し、予測結果として出力 ResNet-34 Color Embedding Space Obj:1 ⑦ 正解データとの誤差を逆伝播し、Embedding Spaceを更新 Cylinder NN 出力: 回答 Obj:2 Box Visual Feature Space 予測 Sphere Shape Embedding Space A: Box BP 正解: Sphere 17

18.

提案手法: Joint Learning of Concepts and Semantic Parsing • 2. Program出力の強化学習（Concept Embeddingは固定）入力1: 画像データ入力2: 質問文 ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 Q: What is the shape of the red object ? ② 質問文からEncoder-Decoder（BiGRU-GRU）ベースの手法 [Dong+, 2016]でProgramを出力 BiGRU-GRU Mask R-CNN ③ Programの1行目の処理に必要なConceptのEmbedding （Color embedding）を獲得 Filter(Red) ↓ Query(Shape) 1 ④ Filter処理を実行（RedとのCosine類似度が最大となるObjに限定） ⑤ Programの2行目の処理に必要なConceptのEmbedding （Shape embedding）を獲得 Red 2 NN ⑥ Query処理を実行（Obj: 2とのCosine類似度が最大となるShapeを獲得）し、予測結果として出力 ResNet-34 Color Embedding Space Obj:1 ⑦ 正解 / 不正解を報酬にReinforceでProgramの生成方策を更新 Cylinder NN 出力: 回答 Obj:2 Box Visual Feature Space 予測 Sphere Shape Embedding Space A: Box Reinforce 正解: Sphere 18

19.

提案手法: Joint Learning of Concepts and Semantic Parsing • 2. Program出力の強化学習（Concept Embeddingは固定）入力1: 画像データ入力2: 質問文 ① Mask R-CNNで画像から物体領域を認識、ResNet-34で Visual Featureを抽出 Q: What is the shape of the red object ? ② 質問文からEncoder-Decoder（BiGRU-GRU）ベースの手法 [Dong+, 2016]でProgramを出力 BiGRU-GRU Mask R-CNN ③ Programの1行目の処理に必要なConceptのEmbedding （Color embedding）を獲得 Filter(Red) ↓ Query(Shape) 1 ④ Filter処理を実行（RedとのCosine類似度が最大となるObjに限定） ⑤ Programの2行目の処理に必要なConceptのEmbedding （Shape embedding）を獲得 Red 2 NN ⑥ Query処理を実行（Obj: 2とのCosine類似度が最大となるShapeを獲得）し、予測結果として出力 ResNet-34 Color Embedding Space Obj:1 ⑦ 正解 / 不正解を報酬にReinforceでProgramの生成方策を更新 Cylinder NN 出力: 回答 Obj:2 Box Visual Feature Space 予測 Sphere Shape Embedding Space A: Box Reinforce 正解: Sphere 19

20.

Initialized with DSL and executor. Q: Does the red object left o 提案手法: Joint Learning of Concepts and Semantic Parsing Lesson1: Object-based questions. Published as a conference paper at ICLR 2019 • 1. と 2. を交互に実行して学習を進める Q: What is the shape of the red object? A: Cube. – Curriculum Learningの枠組みで、少しずつ問題の難度を上げていく Lesson2: Relational questions. A. Curriculum concept learning Initialized with DSL and executor. Lesson1: Object-based questions. Q: What is the shape of the red object? A: Cube. Lesson2: Relational questions. Q: How many cubes are behind the sphere? A: 3 Lesson3: More complex questions. Q: Does the red object left of the green cube have the same shape as the purple matte thing? A: No B. Illustrative execution ofareNS-CL Q: How many cubes behind the sphere? Q: Does the redA:object 3 left of the green cube have the same shape as the Lesson3: More complex questions. purple matte thing? Q: Does the red object left of the green Step1: Visual cube Parsing have the same shape as the purple matte thing? 1 Obj 1 2 Obj 2 A: No 3 Obj 3 Obj 4 sphere have the same color as the Program Representations Concepts Outputs cylinder left of the small matte cube? Filter A: No. Green Cube Object 2 Left Step1: Visual Parsing Obj 1 Obj 2 Obj 3 Obj 4 Step2, 3: Semantic Parsin Program Representatio Filter Objec Relate 4 Deploy: complex scenes, complex questions Step2, 3: Semantic Parsing andthing Program Execution Q: Does the matte behind the big Relate cube have the same shape as purple matte thing? Filter Filter AEQuery Object 1 Objec Figure4: A. Demonstration of thecurriculum learning of visual concepts, word of sentences Filter by watching images Red and reading paired questions and answers. S Deploy: complex scenes, complex questions different complexities are illustrated to the learner in an incremental manne Q: Does the matte thing behind the big neuro-symbolic inference model for VQA. The perception module begins wit Filter Purple Matte sphere have the same color as the into object-based deep representations, while the semantic parser parse sen cylinder left of the small matte cube? Object 1 Object 3 programs. A symbolic execution two modules. 20 AEQuery A: No. Shapeprocess Nobridges (0.98)

21.

アウトライン • 背景 • 関連研究 • 提案手法 • 実験・結果 21

22.

実験: 定量評価 • 実験: CLEVR Dataset [Johnson+, 2017] – 複数配置した球や立方体などの物体に対する質問応答のデータセット – Train: 70K, Valid: 15K, Test: 15K – 訓練データ全体（70K）を用いた場合、一部（5K）を用いた場合の実験を実施 – 10%のデータでも十分なパフォーマンスが出ており、データ効率が良いことが示された Overall QA Accuracy on CLEVR 10% Data 100 100 98.9 95 99.1 99.6 96.9 98.9 90 85.5 80 90 70 60 85 54.7 50 80 40 IEP MAC TbD NS-CL 48.3 IEP MAC TbD NS-CL 図引用: http://nscl.csail.mit.edu/data/resources/2019ICLR-NSCL-poster.pdf http://nscl.csail.mit.edu/data/resources/2019ICLR-NSCL.pptx 22

23.

Published as a conference paper at ICLR 2019 Behind Behind Relate Relate Right Relate Relate Cyan CyanCylinder Cylinder Filter Filter Small Blue Object Filter Filter Shape Query Query 実験: 定性評価 • 実験: CLEVR Dataset [Johnson+, 2017] Gray GrayCylinder Cylinder Filter Filter Cube (0.85) (0.85) Cube – 提案手法は、回答に至るまでの意思決定のプロセスを明示できることが一つのメリット Material AEQuery Yes Material AEQuery Yes (0.92) (0.92) • 間違った回答をした場合、何で間違ったのかを知ることができる Example Example C. Failure Case Case ExampleC. B.Failure Example A. Q: What What is the color of the big box Q: Q: Thereisisthe a small blue object left of of the the blue blue metal cylinder? left that is to the right of the small red matte object; what shape is it? Q: Do the cyan cylinder that is behind the gray cylinder and the gray cylinder have the same material? Concept Program Gray Cylinder Filter Behind Relate Cyan Cylinder Gray Cylinder Material Result Program Program Program Left Left Right Relate Relate Relate BigBox Box Big Filter Filter Small Blue Object Color Color Filter AEQuery Concept Concept Concept Filter Filter BlueMetal Metal Blue Small Red Cylinder Cylinder Execution Filter Execution Matte Object Abort Abort Filter Shape Yes (0.92) Ambiguous Program Program Case Case Example D. Ambiguous Filter Query Query Query Q: Q: What Whatisis the thecolor colorof ofthe thebig big metal metalobject? object? Result Result Concept Program Program No such such object object found! No Big Metal Object Filter Filter Color Result Result Execution Execution Abort Abort Query Query Ambiguous Ambiguous Referral! Referral! Color: Blue Blue ✓ Color: Material: Rubber ✕ Material: Shape: Cylinder Cylinder ✓ Shape: Size: Small Small Size: Cube (0.85) ✓ Figure11: 11: Visualization Visualization of the execution trace generated by our Neuro-Symbolic Figure Neuro-Symbolic Concept Concept Learner Learner on the CLEVR dataset. Example A and B are successful executions that generate correct answers. on the CLEVR dataset. that generate correct answers. In example C, the execution aborts at the ﬁrst operator. To inspect the In example C, the execution the reason reason why why the the execution execution 23 enginefails failsto to ﬁnd ﬁnd the the corresponding corresponding object, we can read out the visual representation engine representation of of the theobject, object,

24.

実験: 定性評価 Concept Program Result Filter Zebra • 実験: VQS Dataset [Gan+, 2017] Count 3 Table Filter On Relate ✓ – 現実画像のデータに対しても本手法は適用可能 Published as a conference paper at ICLR 2019 Filter Shape Object • CLEVRは機械的にデータセットを作成するため、Programのアノテーションも作成可能だが、現実画像のデータに対してProgramのアノテーションをつけるのは高コスト • 提案手法ではProgramのアノテーションが不要であるため、現実画像のデータに対しても適用可能 What Query Knife (0.85) ✓ Example A. ExampleC. B. Example Example D. Q: How many zebras are there? Concept Program Result Q: What Whatkind is theofsharp on the table? Q: desertobject is plated? Concept Concept Zebra Program Program Q: What are the kids doing? Result Result Filter Concept Table Count 3 Filter Desert, Plated Filter On Kind Relate Query ✓ Cake (0.68) ✓ Program Kids Filter What Query Result Playing_Frisbee (0.70) ✕ Groundtruth: Playing_Baseball 24

25.

結論 • Visual QAの問題に対するEnd-to-End学習の中で、物体のコンセプトやロジックの認識を分離して学習する枠組みを提案 – 教師データは質問と回答のペアのみ必要とする • 実験で提案手法の以下の特性を示した – データ効率が良いアルゴリズムであり、少量データで高精度に到達することを実験で示した – 単に回答を出力するのではなく、回答に至るプロセスを明示できることを示した 25

26.

References • Mao, Jiayuan, et al. "The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision." in Proc. of ICLR, 2019. • Hudson, Drew A, et al. ”Compositional attention networks for machine reasoning.” in Proc. of ICLR, 2018. • Mascharka, David, et al. “Transparency by design: Closing the gap between performance and interpretability in visual reasoning.” in Proc. of CVPR, 2018. • Yi, Kexin, et al. “Neural-Symbolic VQA: Disentangling reasoning from vision and language understanding.” in Proc. of NeurIPS, 2018. • Johnson, Justin, et al. “CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning.” in Proc. of CVPR, 2017. • Gan, Chuang, et al. “VQS: Linking segmentations to questions and answers for supervised attention in vqa and question-focused semantic segmentation.” in Proc. of ICCV, 2017. 26

[DL輪読会]The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision

Deep Learning JP

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

【拡散モデル勉強会】拡散モデルの数理

【拡散モデル勉強会】Introduction to Diffusion Models

【拡散モデル勉強会】拡散モデルのサンプラーまとめ

【DL輪読会】Generative Agents: Interactive Simulacra of Human Behavior

各ページのテキスト