【DL輪読会】Bridge-Prompt: Toward Ordinal Action Understanding in Instructional Videos

722 Views

March 20, 23

#deep learning #Deep Learning #Prompt Engineering #Action Understanding #Instructional Videos #NLP

スライド概要

2023/3/3
Deep Learning JP
http://deeplearning.jp/seminar-2/

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 86.4K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 59.8K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 57.2K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 40.5K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 35.1K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 34.7K

各ページのテキスト

DEEP LEARNING JP Bridge-Prompt: Toward Ordinal Action Understanding in Instructional Videos(CVPR 2022) [DL Papers] Yoshifumi Seki http://deeplearning.jp/

http://deeplearning.jp/

書誌情報 ● 投稿先 ○ CVPR 2022 ● 投稿者 ● 選定理由 ○ 動画からの動作解析系に最近取り組んでいます ○ 精華大学 https://github.com/ttlmh/Bridge-Prompt

https://github.com/ttlmh/Bridge-Prompt

背景・目的 ● ● 動画からの動作解析をいい感じにやりたい動作には連続性がある ○ ○ ● 連続性をモデルに組み込みたい ○ ● ex. 水を飲む動作 ■ コップを持つ -> 水を入れる -> 水を飲む ex. パンを食べる動作 ■ バターを塗る -> ジャムをぬる -> パンを食べるグラフモデルは最近いくつかあるが道のラベルには対応できない Prompt Engineeringをやって大規模言語モデルの強みを活かす

Prompt Engineeringとは ● ● ● ● ● 与えられた入力（ラベル情報など）をテンプレートに入れて、適切な文として入力させることで、大規模言語モデルの恩恵を受けられるようにするアイデア GPT-3でのfew shot learningの仕組みに採用 OpenAIのCLIPによる画像分類でtext-image Action CLIPで動画にも適用

CLIP(ICML2021) 2021/1/15の発表より

ActionCLIP ● https://arxiv.org/abs/2109.08472 ラベルからPrompt Engineeringにより文章を生成し、Text Encoder, Video Encoderによって類似性を図ることでラベル推定をする

https://arxiv.org/abs/2109.08472

提案手法

提案手法の全体図

10.

Prompt部の詳細 ● 1. Stastical Prompt ○ ○ ● 2. Ordinal Prompt ○ ○ ● 何番目のactionか This is the {ord_i} action in the video. 3. Semantic Prompt ○ ● いくつactionが動画中にあるか The video has {num} actions. “{ord_i}, the person is performing the action step of {vp_i}” 3+1. Integrated Prompt ○ ○ 全部 Semanticを全て文として並べる

11.

評価用データセット ● 50Salads: 50 top view 30-fps instructional videos regarding salad preparation ○ ● Georgia Tech Egocentric Activities(GTEA): 28 egocentric 15-fps instructional videos daily kitchen activities ○ ● 19 kind of actions 74 class of actions Breakfast: 1,712 third person 15-fps videos of breakfast preparation activities. ○ ○ 48 type of different actions

12.

Implementation ● ● ● 動画は16 frameで分割される Kinetics-400でAction CLIPを用いて事前学習をする

13.

14.

15.

Long-termな映像に対する比較

16.

17.

Fusion Moduleの比較・検討

18.

未知のIDに対する対応力 ● fine-tune時に特定の行動だけを学習させた場合、類似した行動を推定できるか？ ○ ○ cofee2teaはfine-tuneをmaking cofeeだけで行って、 making teaが当てられるかを見る AKLは全体としての精度

19.

まとめ・感想 ● ● ● ● Prompt EngineeringがNLP以外にも出ていることを初めて知って勉強になりました順序を持たせたことがどのような意味を持っているのかがこの実験だとあまりわからなかったので残念未知のIDに対応できているのはすごいけど、この実験方法がそれを測るのに適切かは疑問既存モデルとの違いをもう少し結果から読み取りたかった ○ 精度だけだとどこが良くなっているのかよくわからん