【DL輪読会】Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation

6.2K Views

September 01, 23

#deep learning #pose estimation #group pose #image processing #multi-person estimation

スライド概要

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 89.8K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 64.9K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 60.7K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 46K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 45.8K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 43.9K

各ページのテキスト

DEEP LEARNING JP [DL Papers] Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation (ICCV2023) Hiromu Taketsugu, Ukita Lab, TTI http://deeplearning.jp/ 1

http://deeplearning.jp/

書誌情報 • Group Pose: A Simple Baseline for End-to-End Multi-person Pose Estimation • arXiv: https://arxiv.org/abs/2308.07313 • Github: https://github.com/Michel-liu/GroupPose • 著者：Huan Liu, Qiang Chen, Zichang Tan, Jiang-Jiang Liu, Jian Wang, Xiangbo Su, Xiaolong Li, Kun Yao, Junyu Han, Errui Ding, Yao Zhao, Jingdong Wang（北京交通大，Baiduなど） • 選定理由： – ICCV2023採択論文 – 最近出始めたEnd-to-EndのMulti-person Pose Estimationの最新手法 – MS COCO 2017 のval setとtest-dev set，CrowdPoseでSoTA • 特に断りのない限り，画像の出典は本論文 2

概要 • End-to-End Multi-person Pose Estimation – 画像・動画内に写る不特定多数の人体姿勢を同時に推定 – 人同士のOcclusion，人数に対するScalabilityなどの難しさ https://github.com/IDEA-Research/ED-Pose 3

https://github.com/IDEA-Research/ED-Pose

概要 • Top-downとBottom-up https://openaccess.thecvf.com/content/CVPR2022/html/Shi_End-to-End_Multi-Person_Pose_Estimation_With_Transformers_CVPR_2022_paper.html 4

https://openaccess.thecvf.com/content/CVPR2022/html/Shi_End-to-End_Multi-Person_Pose_Estimation_With_Transformers_CVPR_2022_paper.html

概要 • ここに，Transformerベースの手法が出現 – Top-downやBottom-upのような後処理を必要としない 👆 Group Pose はこちら https://openaccess.thecvf.com/content/CVPR2022/html/Shi_End-to-End_Multi-Person_Pose_Estimation_With_Transformers_CVPR_2022_paper.html 5

https://openaccess.thecvf.com/content/CVPR2022/html/Shi_End-to-End_Multi-Person_Pose_Estimation_With_Transformers_CVPR_2022_paper.html

関連研究 • DETR (DEtection TRansformer) によるEnd-to-End物体検出 – クラスとバウンディングボックスの集合を直接予測 – NMSのような後処理を不要に End-to-End Object Detection with Transformers, Carion et al., ECCV2020 https://arxiv.org/abs/2005.12872 6

https://arxiv.org/abs/2005.12872

関連研究 • PETR によるEnd-to-End Multi-person Pose Estimation – DETRをベースとするアーキテクチャで，人数分のキーポイントを一気に予測 Backbone Encoder Decoder End-to-End Multi-Person Pose Estimation With Transformers, Shi et al., CVPR2022 https://openaccess.thecvf.com/content/CVPR2022/html/Shi_End-to-End_Multi-Person_Pose_Estimation_With_Transformers_CVPR_2022_paper.html 7

https://openaccess.thecvf.com/content/CVPR2022/html/Shi_End-to-End_Multi-Person_Pose_Estimation_With_Transformers_CVPR_2022_paper.html

関連研究 • PETRのDecoderを改良する手法が出始めた 👈 CVPR2022 👈 ICLR2023 👈 今回 (ICCV2023) 8

関連研究 • PETR：Pose Decoderで粗く予測し，Joint DecoderでRefine • ED-Pose：各人のbbox検出後に各キーポイントに対するbboxを検出 Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation, Yang et al., ICLR2023 https://github.com/IDEA-Research/ED-Pose 9

https://github.com/IDEA-Research/ED-Pose

10.

関連研究 • 既存手法の問題点：二段階のDecoder構造 – アーキテクチャや処理の複雑化… – 学習の安定化のために補助的なSupervisionが必要… Heatmap Loss Detection Loss 10

11.

手法 • Group Pose: 一種類のDecoderのみ • シンプルに（キーポイントK点+分類スコア）×N人分のクエリを使用 • Decoder内のコンポーネントとして以下の2点を提案： – 1) within-instance self-attention – 2) across-instance self-attention 11

12.

13.

手法 • Attention maskの利用： – Transformerの収束性改善のため，情報のやり取りが必要ないところは落とす – 例えば，Aさんの手首の情報はBさんの頭の位置を決めるのに必要ない 13

14.

手法 • 1) within-instance self-attention: – 同じインスタンス（同じ人物）に属するクエリどうしで取るAttention – 人体におけるキーポイントどうしの位置関係などを考慮できるように Aさんに属するクエリ B C … Nさんに属するクエリ 14

15.

16.

手法 • 2) across-instance self-attention: – 同じtypeに属するクエリどうしで取るAttention – インスタンス間での予測結果の重複を抑制することができるインスタンスクエリ (分類スコアに対応) 頭鼻 … 右足首 16

17.

手法 • 手法のまとめ – Multi-person Pose EstimationをシンプルなDecoderでEnd-to-Endに解く – 最終的な出力に対応するN×(K+1)個のクエリを用い，収束を良くするために 2種のgroup self-attention (within/across-instance) を提案 17

18.

実験結果 • 姿勢推定性能（一部抜粋） – データセット：CrowdPose – ポイント：Keypoint RegressionのLossのみで概ね最高性能を達成 (HM: Heat Map, BR: Box Regression) 18

19.

実験結果 • 収束性能（横軸：エポック数） – 2つのgroup self-attentionにより学習が易化し，収束性能が改善 19

20.

実験結果 • 収束性能 – 補助的なSupervisionを用いる他手法よりも速い収束 – ED-Pose: 二段階のDecoderで明示的にHuman Detection 20

21.

実験結果 • 推論速度 – シンプルなDecoder構造としたことで大きく速度向上 21

22.

実験結果 • 定性評価 (CrowdPose) – 人同士/物体によるOcclusion，服装/体格の差異にも幅広く対応 22

23.

まとめ・所感 • まとめ：Group Pose – DETR-likeなモデルでEnd-to-End Multi-person Pose Estimation – シンプルなアーキテクチャで補助的なSupervisionを必要とせずSoTA達成 – 推論速度も速い • 所感： – シンプルな問題設定で捉え直したら性能が出た，というのがDETR感あり 23

24.

補足資料 • ED-Pose：各人のbbox検出後に各キーポイントのbboxを検出 Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation, Yang et al., ICLR2023 https://github.com/IDEA-Research/ED-Pose 24

https://github.com/IDEA-Research/ED-Pose

25.

実験結果 • 定性評価 (Failure cases on MS COCO) – 一部のkeypointのみしか画像にない場合 – keypoint scoreはどう使っているのか（そもそも出力されない？） 25