[DL輪読会]EfficientDet: Scalable and Efficient Object Detection

320 Views

November 22, 19

#deep learning #深層学習 #物体検出 #EfficientDet #BiFPN #Scalable

スライド概要

2019/11/22
Deep Learning JP:
http://deeplearning.jp/seminar-2/

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

（ダウンロード不可）

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 84.7K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 59.4K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 54.1K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 38.8K

【拡散モデル勉強会】拡散モデルのサンプラーまとめ

Deep Learning JP 33.3K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 31.9K

各ページのテキスト

DEEP LEARNING JP [DL Seminar] EfficientDet: Scalable and Efficient Object Detection Hiromi Nakagawa ACES, Inc. https://deeplearning.jp

https://deeplearning.jp

Overview • Mingxing Tan, Ruoming Pang, Quoc V. Le（Google Research, Brain Team） – EfficientNet の著者チーム – • Submitted to arXiv on 2019/11/20 物体検出でEfficientNetする – Weighted Bi-directional Feature Pyramid Network (BiFPN)：マルチスケールの特徴を効率的に抽出 – Compound Scaling： resolution, depth, widthを一つの変数でスケール • COCOで精度/サイズ/速度などでSoTAを更新 – #Params: 4x smaller – FLOPs: 9.3x fewer 2

Introduction

Introduction • 近年のObject Detectionのモデルは巨大化しがち – AmoebaNet-based NAS-FPN：167M parameters, 3045B FLOPs（30x more than RetinaNet） – ロボティクスや自動運転といったReal-worldへのdeployの妨げに – モデルをEfficientにすることの重要性が高まっている • 軽量化の傾向もあるが、精度が犠牲になっている – One-stage, Anchor-free, Compression • 特定のリソースに最適化するだけでもダメ。いろんなリソース制約に対応できるモデルがほしい – 3B FLOPs ~ 300B FLOPs ? 4

Introduction • 高精度と高効率を両立することはできるか？Detectorの設計について体系的に調査 • Challenge 1: Efficient Multi-Scale Feature Fusion – マルチスケールの特徴を簡潔かつ効果的に抽出する Bidirectional Feature Pyramid Network (BiFPN) を提案 • Challenge 2: Model Scaling – 入力画像の解像度に加えてネットワークの幅や深さなどをまとめてスケーリングするCompound Scalingを提案 • そもそも強いEfficientNetもBackboneに使う 5

Proposed Method

BiFPN: Bi-directional Feature Pyramid Network • 𝑖𝑛 Multi-scale fusion => aggregate features at different resolutions：𝑃 𝑖𝑛 = (𝑃𝑙𝑖𝑛 , … , 𝑃 ) 𝑙 1 𝑛 ex. Faster-RCNN, YOLO 上層の解像度が低くなる ex. SSD 下層の特徴抽出が不十分下層も大域特徴（コンテキスト）を利用でき、解像度も高い [Lin+ CVPR’17] Feature Pyramid Networks Ref. https://www.slideshare.net/ren4yu/single-shot 7

BiFPN: Bi-directional Feature Pyramid Network • (a) Conventional top-down FPN – Limited by the one-way information flow 8

BiFPN: Bi-directional Feature Pyramid Network • (b) PANet – Adds extra bottom-up path aggregation network • (c) NAS-FPN – Neural architecture search – Requires thousands of GPU hours for search – Irregular network, difficult to interpret or modify 9

10.

BiFPN: Bi-directional Feature Pyramid Network • (e) Simplified PANet – PANet: Accurate but needs more parameters and computations • (f) BiFPN – Extra edges from input to output at the same level – Repeat feature network layer (=bidirectional path) – Remove the nodes whit only 1 input edge 10

11.

BiFPN: Bi-directional Feature Pyramid Network • Weighted feature fusion：How to fuse multi-scale features? – Equally sum? → x – Introduce additional weights, let the network to learn the importance of each input feature – Unbound fusion： • 𝑤𝑖 ：scalar（per-feature）, vector（per-channel）, tensor（per-pixel） • scalar is enough but needs bounding for stable training – Soft-max fusion： • Slowdown on GPU – Fast normalized fusion： • Efficient 11

12.

EfficientDet Architecture • Backbone: ImageNet pretrained EfficientNet • Repeat BiFPN Layer • Class & Box prediction networks share weights across all level of features 12

13.

Compound Scaling • Use compound coefficient 𝝓 to jointly scale up all dimensions – Object detection model has much more scaling dimensions than image classification models Backbone Network 𝐵0, … , 𝐵6 #channels #layers 𝑊𝑏𝑖𝑓𝑝𝑛 = 64 ∙ (1.35𝜙 ) 𝐷𝑐𝑙𝑎𝑠𝑠 = 3 + 𝜙/3 #layers Input size 𝑅𝑖𝑛𝑝𝑢𝑡 = 512 + 𝜙 ∙ 128 𝐷𝑏𝑖𝑓𝑝𝑛 = 2 + 𝜙 13

14.

Experiments

15.

Experiments • Trained with batch size 128 on 32 TPUv3 chips • COCO2017で精度/パラメータ数/速度などでSoTAを達成 15

16.

Experiments • Trained with batch size 128 on 32 TPUv3 chips • COCO2017で精度/パラメータ数/速度などでSoTAを達成 16

17.

Experiments • Real-world latency：Run 10 times with batch size 1 • GPU（ Titan-V ）： Up to 3.2x faster • CPU（ Single-thread Xeon ）：Up to 8.1x faster 17

18.

Experiments • Ablation Study ✓ EfficientNet BackboneにするだけでもRetinaNetから改善 ✓ FPNをBiFPNにすると更に改善 ✓ BiFPNは他のfeature networksに比べて高精度かつ少パラメータ/低FLOPs 18

19.

Experiments • Ablation Study ✓ Feature fusionをSoftmaxからFast Fusionにするとほとんど精度低下せずに30%ほど高速化できる Softmax Fusion ✓ Compound Scalingによって個別にスケールを最適化するより優れたmAP/FLOPsのモデルが得られる Fast Fusion 19

20.

Conclusion

21.

まとめ • 高速・高精度・省計算な物体検出モデルであるEfficientDetを提案 – EfficientNetをBackboneに – マルチスケールの特徴を効率的に抽出するBiFPNモジュールを提案、複数積み重ねて高次の特徴も抽出 – 共通の変数で解像度/幅/深さを複合的にスケーリングするCompound Scalingによる効率的なパラメータ探索 • COCOデータでSoTAの精度/速度を達成 – 4x smaller and 9.3x fewer FLOPs – Latency：3.2x faster @GPU、8.1x faster@CPU 21

22.

感想 • シンプルな工夫/拡張で精度/速度を改善。そりゃ良くなるよな、という感じ – NAS-FPNみたいな魔改造感がない • YOLOv3（arXiv18.04）の某グラフと比べると進展の速さを感じる • その他 – Efficientだし精度もSoTAを更新した。より精度を上げるためにEfficientさを捨てるとしたらどの方向？ – 最小解像度が512からの比較。それより小さくなると？ – 他の評価指標（mAPxx）やデータセットでのパフォーマンスは？ – Compound Scalingにおけるヒューリスティック、どれくらいセンシティブ？ – Keypointベースのアプローチと組み合わせるとどんな感じになる？ここらへん？ 22