Foundation Models for 3D Scene Understanding

1.

DEEP LEARNING JP [DL Papers] “Foundation Models for 3D Scene Understanding” 2024.09.19 Taiki Miyanishi, Matsuo-Iwasawa Lab http://deeplearning.jp/

http://deeplearning.jp/

2.

書誌情報 ● 3次元シーン理解に関する基盤モデルについて o An Embodied Generalist Agent in 3D World (Huang+, ICML 2024) o Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding (Yuan+, CVPR 2024) o Agent3D-Zero: An Agent for Zero-shot 3D Understanding (Zhang+, ECCV 2024) o Coarse Correspondence Elicit 3D Spacetime Understanding in Multimodal Language Model (Liu+, arXiv 2024) ● 論文の選定理由 o 3次元シーン理解は、自動運転、ロボット、AR/VRなど、様々な分野に必要とされている重要な研究テーマ o 3次元シーン理解（空間認識・推論）の基盤モデルや、大規模視覚言語モデルを用いた3Dシーンの理解が注目されている ※以後、図表や動画像については各論文とプロジェクトサイトから引用 2

3.

Deep Hough Voting for 3D Object Detection in Point Clouds 3次元シーン理解とは？ Point Sets for 3D Classification and Segmentation Input Point Cloud (point set representation) 1 Charles R. Qi Or Litany Kaiming He Leonidas J. Guibas ● 機械が3次元空間内の情報を認識し、そのシーン（背景）に何が存在し、それらがどのように Su* Kaichun Mo Leonidas J. Guibas 1 2 Facebook AI Research Stanford University 配置され、相互作用しているのかを理解する能力 Stanford University 3D Semantic Segmentation (Zhao+, CVPR 2017) 664v2 [cs.CV] 22 Aug 2019 metric data researchers collections necessarily we design a umes point variance of ntNet, pronging from ne semantic fficient and rmance on eoretically, f what the robust with 1 PointNet Abstract 1 1,2 3D Object Detection (Qi+, ICCV 2021) Voting from input point cloud 3D detection output Current 3D object detection methods are heavily influmug? enced by 2D detectors. In order to leverage architectures in 2D detectors, they often convert 3D point clouds to regutable? lar grids (i.e., to voxel grids or to bird’s eye view images), or rely on detection in 2D images to propose 3D boxes. car? Few works have attempted to directly detect objects in point Classification Semantic Segmentation Part Segmentation clouds. In this work, we return to first principles to conFigure 1. Applications of PointNet. We propose a novel deep net struct a 3D detection pipeline for point cloud data and as Figure 1. 3D object detection in point clouds with a deep architecture that consumes raw point cloud (set of points) without generic as possible. However, due to the sparse nature of 3D Instance Segmentation (Jonas+, ICRA 2023) Hough voting model. Given a point cloud of a 3D scene, our voxelization or rendering. It is a unified architecture that learns Sofa1 VoteNet votes to object centers and then groups and aggregates the data – samples from 2D manifolds in 3D space – we face Table 1 both global and local point features, providing a simple, efficient the votes to predict 3D bounding boxes and semantic classes of a major challenge when directly predicting bounding box and effective approach for a number of 3D recognition tasks. Table 3 objects. Our code is open sourced at https://github.com/ parameters from scene points: a 3D object centroid can be facebookresearch/votenet far from any surface point thus hard to regress accurately in still has to respect the fact that a point cloud is just a one step. To address the challenge, we propose VoteNet, an Sofa 2 set of points and therefore invariant to permutations of its Sofa 3 end-to-end 3D object detection network based on a synergy points to regular 2D bird’s eye view images and then apply members,ofnecessitating certain symmetrizations in the net deep point set networks and Hough voting. Our model Table 2 2D detectors to localize objects. This, however, sacrifices computation. Further invariances to rigid motions also need achieves state-of-the-art 3D detection on two large datasets geometric details which may be critical in cluttered indoor to be considered. of real 3D scans, ScanNet and SUN RGB-D with a simple environments. More recently, [20, 34] proposed a cascaded Our PointNet is a unified architecture that directly 3

https://github.com/

4.

⇤ ⇤ 3次元シーン理解の代表的なタスク ChenDaichi Azuma Angel X. Chang Taiki Miyanishi Matthias Nießner 1 2 Kyoto University cal University of Munich 1 ATR, RIKEN AIP 2 Simon Fraser University 3D Visual Grounding (Chen+, ECCV 2020) Shuhei Kurita RIKEN AIP, JST PRESTO ⇤ 3D Question Answering (Azuma+, CVPR 2022) Question + 3D-Scan Abstract Motoaki Kawanabe ATR, RIKEN AIP Answer + 3D-Bounding Box 3D Scan We propose a new 3D spatial understanding task for 3D question answering (3D-QA). In the 3D-QA task, models receive visual information from the entire 3D scene of a rich RGB-D indoor scan and answer given textual quesQ. Where is the medium sized A. in front of right bed blue suitcase laid? tions about the 3D scene. Unlike the 2D-question answer3D Scan ing of visual question answering, the conventional 2D-QA duce the task of object localization in 3D scenes using natural models suffer from problems with spatial understanding of 3D expression, Dense Captioning (Chen+, CVPR 2021) as input a 3D scene and a natural language we object alignment and directions and fail in object localnding box for the target 3D object (right). The counterpart 2D ization from textual questions in 3D-QA. not capture thethe physical extent of the 3D objects. We propose 1 a baseline model for 3D-QA, called the ScanQA , which learns a fused descriptor from 3D object proposals and enQ. What is sitting on the floor between A. 2 black backpacks coded sentence embeddings. This learned descriptor corthe tv and the wooden chair? We introduce the task of 3D object localization in RGB-D relates language expressions with the underlying geometgric natural language descriptions. As input, we assume a point Figure 1. We introduce the new task of question answering for 3D features of the 3D scan and facilitates the regression scanned 3D scene along with a free-form description of a specmodeling. Given inputs of an entire 3D modeling and a linguistic of 3D bounding boxes to determine the described objects t object. To address this task, we propose ScanRefer, learn- 4

https://github.com/ATR-DBI/ScanQA

5.

An Embodied Generalist Agent in 3D World Jiangyong Huang, Silong Yong, Xiaojian Ma, Xiongkun Linghu, Puhao Li, Yan Wang, Qing Li, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang ICML 2024 5

6.

[beta]

An Embodied Generalist Agent in 3D World (Huang+, ICML 2024)
●

3D環境で人間のように多様なタスクをこなせる汎用エージェントLEOの開発

●

一人称画像、３次元点群、テキストを入力とし、自己回帰系列予測問題として視覚・言語・行動タスクを解く

●

LEOに指示チューニングを適用することで、3D世界の様々なタスクで優れた性能を発揮することを実証
An Embodied Generalist Agent in 3D World
Scene Captioning
LEO: In this
room, there is a
wooden floor
that is clean
and flat. A tall
wardrobe …

3D Question
Answering

Embodied Generalist Agent in 3D World
Egocentric
Image

System
Message
You are an assistant situated
in a 3D scene.
❄

Tokenizer

3D Scene

Human: I need help organizing my bedroom. Can you assist me?
LEO: Of course! What specifically do you need help with?
Human: Can you check if there are any clean clothes hanging on the wall?
LEO: I'm sorry, but the clothes hanging on the wall are
messy.
Human: Thanks! Also, can you check if there are any clean
clothes hanging on the wall?

Human: pack
the yoshi figure in the
brown box
LEO:
P = [$. &, −$. ), $]
R = [$, $, $, &]

Instruction
User: Please describe the toy house
over ...

"

2D Encoder

"

3D Encoder

❄

Tokenizer

Embodied
Navigation

Question: Where is the Printer
located?
LEO: Standing on the desk
next to the wardrobe.

3D Dialogue

Robotic
Manipulation

Human: Find the pillow
LoRA!,

Large Language Model
Text Response
There is a sofa
It’s a kitchen for
next to the TV.
cooking.

Action Response
P = [#. %, −#. (, #]
R = [#, #, #, %]

“Turn right”

LEO:
1. Move forward
2. Turn right
3. Move forward
4. …

De-tokenize

Task Planning
3D Object Captioning
LEO: A 3D model of a Southern White Rhino,
one of largest animal species in the world.

LEO: It’s a 3D model of a midnight
green ceramic coffee cup with a plate.

Embodied Reasoning
Situation: Sitting at the edge of the bed and
facing the couch.
Question: Can I go straight to the coffee
table in front of me?
LEO: No.

Human: Organize and tidy up
the bedroom.
LEO:
1. Clean the floor by
sweeping to remove …
2. Make the bed …
3. Fold and organize …

Figure 1: The proposed embodied generalist agent LEO. It takes egocentric 2D images, 3D point clouds, and texts as input and

6

7.

[beta]

LEOの学習方法：3D VL アライメント
●

学習方法は２段階
–

3Dの視覚と言語の情報を結びつける、3D VL（視覚・言語）アラインメント

–

3Dの視覚・言語・行動を結びつける、3D VLA（視覚・言語・行動）指示チューニング
An Embodied Generalist Agent in 3D World

Scene Captioning
LEO: In this
room, there is a
wooden floor
that is clean
and flat. A tall
wardrobe …

3D Question
Answering

Embodied Generalist Agent in 3D World
Egocentric
Image

System
Message
You are an assistant situated
in a 3D scene.
❄

Tokenizer

3D Scene

Dataset:

Human: I need help organizing my bedroom. Can you assist me?
LEO: Of course! What specifically do you need help with?
Human: Can you check if there are any clean clothes hanging on the wall?
LEO: I'm sorry, but the clothes hanging on the wall are
messy.
Human: Thanks! Also, can you check if there are any clean
clothes hanging on the wall?

• ScanNet
• 3RScan

• Objaverse

3D
Encoder:
Human: pack

Instruction

the yoshi figure in the
brown box
LEO:
P = [$. &, −$. ), $]
R = [$, $, $, &]

User: Please describe the toy house
over ...
"

2D Encoder

"

3D Encoder

❄

• Mask3D
• PointNet++

Tokenizer

Embodied
Navigation

• Spatial Transformer

Question: Where is the Printer
located?
LEO: Standing on the desk
next to the wardrobe.

3D Dialogue

Robotic
Manipulation

Human: Find the pillow
LoRA!,

Large Language Model
Text Response
There is a sofa
It’s a kitchen for
next to the TV.
cooking.

Action Response
P = [#. %, −#. (, #]
R = [#, #, #, %]

“Turn right”

LEO:
1. Move forward
2. Turn right
3. Move forward
4. …

De-tokenize

Task Planning
3D Object Captioning
LEO: A 3D model of a Southern White Rhino,
one of largest animal species in the world.

LEO: It’s a 3D model of a midnight
green ceramic coffee cup with a plate.

Embodied Reasoning
Situation: Sitting at the edge of the bed and
facing the couch.
Question: Can I go straight to the coffee
table in front of me?
LEO: No.

Human: Organize and tidy up
the bedroom.
LEO:
1. Clean the floor by
sweeping to remove …
2. Make the bed …
3. Fold and organize …

Figure 1: The proposed embodied generalist agent LEO. It takes egocentric 2D images, 3D point clouds, and texts as input and

7

8.

[beta]

LEOの学習方法：3D VLA 指示チューニング
●

学習方法は２段階
–

3Dの視覚と言語の情報を結びつける、3D VL（視覚・言語）アラインメント

–

3Dの視覚・言語・行動を結びつける、3D VLA（視覚・言語・行動）指示チューニング
An Embodied Generalist Agent in 3D World

Scene Captioning
LEO: In this
room, there is a
wooden floor
that is clean
and flat. A tall
wardrobe …

2D
Encoder:
3D Question

Embodied Generalist Agent in 3D World
Egocentric
Image

System
Message
You are an assistant situated
in a 3D scene.
❄

Tokenizer

3D Scene

Robotic
Manipulation
Human: pack
the yoshi figure in the
brown box
LEO:
P = [$. &, −$. ), $]
R = [$, $, $, &]

Instruction

3D Encoder:

User: Please describe the toy house
over ...
"

2D Encoder

"

3D Encoder

❄

• Spatial Transformer

Tokenizer

Answering

Embodied
Navigation

• CLIP

Question: Where is the Printer
located?
LEO: Standing on the desk
next to the wardrobe.

• Adapter
3D Dialogue

Human: I need help organizing my bedroom. Can you assist me?
LEO: Of course! What specifically do you need help with?
Human: Can you check if there are any clean clothes hanging on the wall?
LEO: I'm sorry, but the clothes hanging on the wall are
messy.
Human: Thanks! Also, can you check if there are any clean
clothes hanging on the wall?

Human: Find the pillow
LoRA!,

Large Language Model
Text Response
There is a sofa
It’s a kitchen for
next to the TV.
cooking.

Action Response
P = [#. %, −#. (, #]
R = [#, #, #, %]

“Turn right”

LEO:
1. Move forward
2. Turn right
3. Move forward
4. …

De-tokenize

Task Planning
3D Object Captioning
LEO: A 3D model of a Southern White Rhino,
one of largest animal species in the world.

LEO: It’s a 3D model of a midnight
green ceramic coffee cup with a plate.

Embodied Reasoning
Situation: Sitting at the edge of the bed and
facing the couch.
Question: Can I go straight to the coffee
table in front of me?
LEO: No.

Human: Organize and tidy up
the bedroom.
LEO:
1. Clean the floor by
sweeping to remove …
2. Make the bed …
3. Fold and organize …

Figure 1: The proposed embodied generalist agent LEO. It takes egocentric 2D images, 3D point clouds, and texts as input and

8

9.

[beta]

指示チューニングデータの半自動作成
●

An Embodied Generalist Agent in 3D World

シーングラフベースのプロンプティング、物体中心のChain-of-Thought、リファインメントを行う
Scene-Graph-based Prompting
Messages
1. System Message
You are an AI visual assistant in a 3D scene…
2. Demonstrations
Scene Graph Context:
{'sofa-1': {'attributes': {'color': 'red'}, 'relations': ['to
the right of chair-2', 'in front of table-3']}, 'chair-2':
{'attributes': {'color': 'brown'}, 'relations': []}}
Human-labeled Responses

3.Query(a new scene)
Scene Graph Context

Refinement Procedures

Responses

Raw Responses

1) Object scene caption
2) Scene caption
3) Dialogue(O-CoT)
4) QA(O-CoT)
5) Planning

1. Wrong Answers
There are two curtains in the room.
2. Negative Responses
Question: Where is the football table?
Thought: football table-17 Answer:
unknown
3. Responses with ID
The kitchen cabinet-9 and kitchen
counter-10 are parts of the kitchen.
4. Others
Unnecessary content (Thought, etc.)

Potential Issues:
1) Wrong Answers
2) Negative Responses
3) Responses with IDs
3) Others

Dialogue(O-CoT):
Dialogue Context: high level task: organize the bedroom. low level task:
check some objects.
Human: I need help organizing my bedroom. Can you assist me? Thought:
no specific objects
Assistant: Of course! What specifically do you need help with?
Human: Can you check if there are any clean clothes hanging on the wall?
Thought: clothes-13
Assistant: I'm sorry, but the clothes hanging on the wall are messy.
Human: Thanks for your help! Thought: no specific object
Assistant: You're welcome! If you need any further assistance, feel free to ask.
Object Scene Caption:
The showcase is supported by
the wall and positioned behind,
close to, and to the left of the
shelf. The wall is flat, architectural,
and dark in color. The shelf is
made of wood and has a low size.

QA(O-CoT):
Question: Where is the printer located? Thought: printer-8
Answer: standing on the desk
Question: How many blankets are on the bed? Thought:
blanket-16, blanket-17 Answer: 2
Question: What is the type of the room? Thought:wardrobe2, desk-7, chair-11, bed-15 Answer: bedroom

Refined Responses
1. Answer Refinement
There are 3 curtains in the room.
2. Removing Negative Responses
The negative responses will be
removed.
3. GPT-based rewriting
The kitchen features a cabinet and
a counter.
4. Others
Remove unnecessary content.

Scene Caption:
In this room, there is a wooden floor that is
clean and flat. A tall wardrobe stand on the
right side of a desk, close to a basket. The
wardrobe is in front of a chair, a bed, and
behind a bedside table. Clothes are hanging
on a white wall, to the right of a closed
curtain. The bed is wide, clean, and covered
with blue blanket. The room has a
comfortable and organized setting with
functional furniture.
Planning:
High-Level Task: Organize and tidy up the bedroom.
Low-Level Actions:
1. Clean the floor by sweeping to remove any dirt.
2. Make the bed by arranging the blanket and pillows.
3. Place any loose items or belongings into the basket.
4. Arrange items on the shelves and showcase in a tidy way.

Figure 2: Our proposed LLM-assisted 3D-language data generation pipeline and data examples.. (Top-left) Messages with 3D scene

9

10.

[beta]

指示チューニングデータの半自動作成
●

An Embodied Generalist Agent in 3D World

シーングラフベースのプロンプティング、物体中心のChain-of-Thought、リファインメントを行う
Scene-Graph-based Prompting
Messages
1. System Message
You are an AI visual assistant in a 3D scene…
2. Demonstrations
Scene Graph Context:
{'sofa-1': {'attributes': {'color': 'red'}, 'relations': ['to
the right of chair-2', 'in front of table-3']}, 'chair-2':
{'attributes': {'color': 'brown'}, 'relations': []}}
Human-labeled Responses

3.Query(a new scene)
Scene Graph Context

Refinement Procedures

Responses

Raw Responses

1) Object scene caption
2) Scene caption
3) Dialogue(O-CoT)
4) QA(O-CoT)
5) Planning

1. Wrong Answers
There are two curtains in the room.
2. Negative Responses
Question: Where is the football table?
Thought: football table-17 Answer:
unknown
3. Responses with ID
The kitchen cabinet-9 and kitchen
counter-10 are parts of the kitchen.
4. Others
Unnecessary content (Thought, etc.)

Potential Issues:
1) Wrong Answers
2) Negative Responses
3) Responses with IDs
3) Others

Dialogue(O-CoT):
Dialogue Context: high level task: organize the bedroom. low level task:
check some objects.
Human: I need help organizing my bedroom. Can you assist me? Thought:
no specific objects
Assistant: Of course! What specifically do you need help with?
Human: Can you check if there are any clean clothes hanging on the wall?
Thought: clothes-13
Assistant: I'm sorry, but the clothes hanging on the wall are messy.
Human: Thanks for your help! Thought: no specific object
Assistant: You're welcome! If you need any further assistance, feel free to ask.

Refined Responses
1. Answer Refinement
There are 3 curtains in the room.
2. Removing Negative Responses
The negative responses will be
removed.
3. GPT-based rewriting
The kitchen features a cabinet and
a counter.
4. Others
Remove unnecessary content.

Scene Caption:
In this room, there is a wooden floor that is
ottoman:seat:furniture
clean and flat. A tall wardrobe stand on the
right side of a desk, close tolyingaonbasket. The
shape: rectangular
color: brown,
dark
wardrobe
is in front of a chair, a bed, and
affordance: sitting
behind a bedside table. Clothes are hanging
on a white wall, to the right of a closed
curtain. The bed is wide, clean, and covered
with blue blanket. The room has a
sofa:seat:furniture
comfortable and organized setting with
functional
furniture.
shape: L-shaped
standing close by
color: brown

Object Scene Caption:
The showcase is supported by
the wall and positioned behind,
close to, and to the left of the
shelf. The wall is flat, architectural,
and dark in color. The shelf is
made of wood and has a low size.

hand bag:item
shape: rectangular
color: white, brown
material: leather

coffee table:table:furniture
size: low
shape: rectangular
texture: wooden

QA(O-CoT):
Planning:
Question: Where is the printer located? Thought: printer-8
High-Level Task: Organize and tidy up the bedroom.
Answer: standing on the desk
Low-Level Actions:
relationship
Node
attributes
class
hierarchy
3DSSG
(Wald+,
CVPR
2020)
(support,
(object instance)
(shape,
color, etc.)
(lexical relations)
Question: How many blankets are on the bed? Thought:
1. Clean the
floor
by sweeping to remove
anyspatial,
dirt.etc.)
blanket-16, blanket-17 Answer: 2
2. Make the bed by arranging the blanket and pillows.
Figure
2.
Scene
graph
representation
in
3DSSG
including
hierarchical
class
labels
c
and
attributes
A
per
node,
as
well
as
relationship
Question: What is the type of the room? Thought:wardrobe3. Place any loose items or belongings into the basket.
triplets
between
nodes.
2, desk-7, chair-11, bed-15 Answer: bedroom
4. Arrange items on the shelves and showcase in a tidy way.

Figure 2: Our proposed LLM-assisted 3D-language data generation pipeline and data examples.. (Top-left) Messages with 3D scene

10

11.

[beta]

no specific objects
Assistant: Of course! What specifically do you need help with?
指示チューニングデータの半自動作成
Human: Can you check if there are any clean clothes hanging on the wall?
Thought: clothes-13
An
Embodied
Generalist
Agent
in
3D
World
Assistant: I'm sorry, but the clothes hanging on the wall are messy.
● シーングラフベースのプロンプティング、物体中心のChain-of-Thought、リファインメントを行う
Human: Thanks for your help! Thought: no specific object
Prompting
Refinement
Procedures
Assistant:Scene-Graph-based
You're welcome! If you
need any further assistance,
feel free to
ask.
Messages
1. System Message
You are an AI visual assistant in a 3D scene…

Responses

Raw Responses

Refined Responses

1) Object scene caption
2) Scene caption
3) Dialogue(O-CoT)
4) QA(O-CoT)
5) Planning

1. Wrong Answers
There are two curtains in the room.
2. Negative Responses
Question: Where is the football table?
Thought: football table-17 Answer:
unknown
3. Responses with ID
The kitchen cabinet-9 and kitchen
counter-10 are parts of the kitchen.
4. Others
Unnecessary content (Thought, etc.)

1. Answer Refinement
There are 3 curtains in the room.

Object Scene Caption:
QA(O-CoT):
2. Removing Negative Responses
2. Demonstrations
TheScene
showcase
is
supported
by
Question:
Where
is
the
printer
located?
Thought:
printer-8
The
negative
responses
will
be
Graph Context:
removed.
{'sofa-1':
{'attributes':
{'color':
'red'},
'relations':
['to
the the
wall
and
positioned
behind,
Answer:
standing
on
the
desk
Potential
Issues:
right of chair-2', 'in front of table-3']}, 'chair-2':
3. GPT-based rewriting
1)
Wrong
Answers
{'attributes':
{'color':to
'brown'},
The
kitchen
features
a cabinet
and
close
to, and
the 'relations':
left of[]}}the
Question:
How
many
blankets
are
on
the
bed?
Thought:
2)
Negative
Responses
a counter.
Human-labeled Responses
3) Responses
with IDs
shelf.
The
wall
is
flat,
architectural,
blanket-16,
blanket-17
Answer:
2
4.
Others
3.Query(a new scene)
3) Others
Remove unnecessary content.
Scene Graph Context
and dark in color. The shelf is
Question: What is the type of the room? Thought:wardrobemade
of
wood
and
has
a
low
size.
2,
desk-7,
chair-11,
bed-15
Answer:
bedroom
Scene Caption:
Dialogue(O-CoT):
Dialogue Context: high level task: organize the bedroom. low level task:
check some objects.
Human: I need help organizing my bedroom. Can you assist me? Thought:
no specific objects
Assistant: Of course! What specifically do you need help with?
Human: Can you check if there are any clean clothes hanging on the wall?
Thought: clothes-13
Assistant: I'm sorry, but the clothes hanging on the wall are messy.
Human: Thanks for your help! Thought: no specific object
Assistant: You're welcome! If you need any further assistance, feel free to ask.

Plannin
High-Le
Low-Lev
1. Clean
2. Make
3. Place
4. Arran

In this room, there is a wooden floor that is
clean and flat. A tall wardrobe stand on the
right side of a desk, close to a basket. The
wardrobe is in front of a chair, a bed, and
behind a bedside table. Clothes are hanging
on a white wall, to the right of a closed
curtain. The bed is wide, clean, and covered
with blue blanket. The room has a
comfortable and organized setting with
functional furniture.

Figure 2: Our proposed LLM-assisted 3D-language data generation pipeline and data ex
graphs, including object attributes and relations in a phrasal form, used for providing scene
The human-defined refinement procedures were conducted over raw LLM responses to im
LLM-assisted generation in LEO-align and LEO-instruct. Thoughts, colored in gray, will b
Object Scene Caption:
The showcase is supported by
the wall and positioned behind,
close to, and to the left of the
shelf. The wall is flat, architectural,
and dark in color. The shelf is
made of wood and has a low size.

QA(O-CoT):
Question: Where is the printer located? Thought: printer-8
Answer: standing on the desk
Question: How many blankets are on the bed? Thought:
blanket-16, blanket-17 Answer: 2
Question: What is the type of the room? Thought:wardrobe2, desk-7, chair-11, bed-15 Answer: bedroom

Planning:
High-Level Task: Organize and tidy up the bedroom.
Low-Level Actions:
1. Clean the floor by sweeping to remove any dirt.
2. Make the bed by arranging the blanket and pillows.
3. Place any loose items or belongings into the basket.
4. Arrange items on the shelves and showcase in a tidy way.

Appendix B.7). We provide examples of O-CoT in Fig. 2.

pairs are correct, a
in Tab. 2 demonstra
11

Refinement
procedures.
thepipeline
scene
graph
OFigure 2: Our proposed
LLM-assisted 3D-languageUpon
data generation
and data
examples..and
(Top-left)
Messages with 3D scene

12.

[beta]

指示チューニングデータの半自動作成
●

An Embodied Generalist Agent in 3D World

シーングラフベースのプロンプティング、物体中心のChain-of-Thought、リファインメントを行う
Scene-Graph-based Prompting
Messages
1. System Message
You are an AI visual assistant in a 3D scene…
2. Demonstrations
Scene Graph Context:
{'sofa-1': {'attributes': {'color': 'red'}, 'relations': ['to
the right of chair-2', 'in front of table-3']}, 'chair-2':
{'attributes': {'color': 'brown'}, 'relations': []}}
Human-labeled Responses

3.Query(a new scene)
Scene Graph Context

Refinement Procedures

Responses

Raw Responses

1) Object scene caption
2) Scene caption
3) Dialogue(O-CoT)
4) QA(O-CoT)
5) Planning

1. Wrong Answers
There are two curtains in the room.
2. Negative Responses
Question: Where is the football table?
Thought: football table-17 Answer:
unknown
3. Responses with ID
The kitchen cabinet-9 and kitchen
counter-10 are parts of the kitchen.
4. Others
Unnecessary content (Thought, etc.)

Potential Issues:
1) Wrong Answers
2) Negative Responses
3) Responses with IDs
3) Others

Dialogue(O-CoT):
Dialogue Context: high level task: organize the bedroom. low level task:
check some objects.
Human: I need help organizing my bedroom. Can you assist me? Thought:
no specific objects
Assistant: Of course! What specifically do you need help with?
Human: Can you check if there are any clean clothes hanging on the wall?
Thought: clothes-13
Assistant: I'm sorry, but the clothes hanging on the wall are messy.
Human: Thanks for your help! Thought: no specific object
Assistant: You're welcome! If you need any further assistance, feel free to ask.
Object Scene Caption:
The showcase is supported by
the wall and positioned behind,
close to, and to the left of the
shelf. The wall is flat, architectural,
and dark in color. The shelf is
made of wood and has a low size.

QA(O-CoT):
Question: Where is the printer located? Thought: printer-8
Answer: standing on the desk
Question: How many blankets are on the bed? Thought:
blanket-16, blanket-17 Answer: 2
Question: What is the type of the room? Thought:wardrobe2, desk-7, chair-11, bed-15 Answer: bedroom

Refined Responses
1. Answer Refinement
There are 3 curtains in the room.
2. Removing Negative Responses
The negative responses will be
removed.
3. GPT-based rewriting
The kitchen features a cabinet and
a counter.
4. Others
Remove unnecessary content.

Scene Caption:
In this room, there is a wooden floor that is
clean and flat. A tall wardrobe stand on the
right side of a desk, close to a basket. The
wardrobe is in front of a chair, a bed, and
behind a bedside table. Clothes are hanging
on a white wall, to the right of a closed
curtain. The bed is wide, clean, and covered
with blue blanket. The room has a
comfortable and organized setting with
functional furniture.
Planning:
High-Level Task: Organize and tidy up the bedroom.
Low-Level Actions:
1. Clean the floor by sweeping to remove any dirt.
2. Make the bed by arranging the blanket and pillows.
3. Place any loose items or belongings into the basket.
4. Arrange items on the shelves and showcase in a tidy way.

Figure 2: Our proposed LLM-assisted 3D-language data generation pipeline and data examples.. (Top-left) Messages with 3D scene

12

13.

3Dキャプション生成、質問応答、状況に応じた質問応答での評価 Table 4: Quantitative comparison with state-of-the-art models on 3D VL under- Tab standing and embodied reasoning tasks. “C” stands for “CIDEr”, “B-4” for “BLEU- ind ● タスクに特化した標準的な手法Scan2Cap (Chen+, CVPR 2021)やScanQA (Azuma+, CVPR 2022) 4”,を上回る性能を示した “M” for “METEOR”, “R” for “ROUGE”, “Sim” for sentence similarity, and “EM@1” tas † for top-1 exact match. The n-gram metrics for Scan2Cap are governed by IoU@0.5. ● また、各タスクでファインチューニングした3D視覚基盤モデル、3D-LLM (Hong+, NeurIPS 2023)や indicates answering questions via prompting GPT-3 with the generated scene caption. 3D-VisTA (Zhu+, ICCV 2023)よりも優れた性能を示した Gray indicates evaluation results with refined exact-match protocol. 3Dキャプション生成 Scan2Cap (val) C Task-specific models Scan2Cap 3DJCG Vote2Cap-DETR ScanRefer+MCAN ClipBERT ScanQA Task-specific fine-tuned 状況に応じた質問応答 3D質問応答 B-4 M R ScanQA (val) Sim C B-4 M R SQA3D (test) EM@1 EM@1 CL CL CL LE 35.2 47.7 61.8 - 22.4 31.5 34.5 - 21.4 24.3 26.2 - 43.5 51.8 54.4 - - 55.4 64.9 7.9 10.1 11.5 13.1 30.0 33.3 18.6 21.1 41.0 43.3 47.2 † 3D-VisTA 3D-LLM (FlanT5) 66.9 - 34.0 - 27.1 - 54.3 - 53.8 - 69.6 69.4 10.4 12.0 13.9 14.5 35.7 35.7 22.4 20.5 48.5 - LEO 72.4 38.2 27.9 58.1 55.3 101.4 13.2 20.0 49.2 24.5 (47.6) 50.0 (52.4) Tab cat Ha H 13

14.

3Dキャプション生成、質問応答、状況に応じた質問応答での評価 Table 4: Quantitative comparison with state-of-the-art models on 3D VL underTab TrueSkill scores with human pref- Figure 3: LEO-instruct test loss standing and embodied reasoning tasks. “C” stands for “CIDEr”, “B-4” for “BLEUind with the growth of data and model タスクに特化した標準的な手法Scan2Cap (Chen+, CVPR 2021)やScanQA (Azuma+, CVPR 2022) Dialg:●4”, dialogue and planning data. “M” for “METEOR”, “R” for “ROUGE”, “Sim” for sentence similarity, and “EM@1” tas を上回る性能を示した † scale, manifesting the scaling law. for top-1 exact match. The n-gram metrics for Scan2Cap are governed by IoU@0.5. Answerable Unanswerable NLP ● また、各タスクでファインチューニングした3D視覚基盤モデル、3D-LLM (Hong+, NeurIPS 2023)や indicates answering questions via prompting GPT-3 with the generated scene caption. Test Loss 3D-VisTA (Zhu+, ICCV 2023)よりも優れた性能を示した 24.4±1.3 23.1±1.4 23.4±1.4 Gray indicates evaluation results with2.0 refined exact-match protocol. 状況に応じた質問応答 3Dキャプション生成 3D質問応答 Aligned OPT-1.3B 25.6±1.3 26.8±1.4 26.6±1.4 CL Scan2Cap (val) Answer accuracy (EM) on objectTask-specific models questions. Aug: augmented data. Scan2Cap 35.2 22.4 21.4 43.5 Yes 1.00 0.72 C Sim 1.6 C 3DJCG 47.7 31.5 24.3 51.8 3RScan ScanNet (0-shot) Vote2Cap-DETR 61.8 34.5 26.2 54.4 1.2 ScanRefer+MCAN No Overall Yes No Overall ClipBERT ScanQA 0.01 0.34 0.98 0.16 0.43 0.8 Task-specific fine-tuned 1.5 0.91 0.85 0.88 0.81 0.83 3D-VisTA 66.9 34.0 27.1 54.3 53.8 3D-LLM (FlanT5) - 55.4 64.9 LEO 72.4 B-4 38.2 M 27.9 R ScanQA (val) 69.6 69.4 B-4 7.9 10.1 3 10.4 12.0 M 11.5 13.1 13.9 14.5 R 30.0 33.3 6 35.7 35.7 SQA3D (test) Scratch Vicuna-7B EM@1 EM@1 Aligned Vicuna-7B CL CL Aligned Vicuna-13B † 41.0 18.6 43.3 21.1 47.2 LE 22.4 20.5 12 4 (10 ) 48.5 #Data - モデルのサイズが大きくなり、データ量が増えるほど、テストの損失が下がる 58.1 55.3 101.4 13.2 20.0 49.2 24.5 (47.6) 50.0 (52.4) Tab cat Ha H 14

15.

定性評価：マニピュレーション、ナビゲーション ● 未知の色や物体カテゴリについても物体操作が可能、未知環境でもナビゲーションが可能 15

16.

Visual Programming for Zero-shot OpenVocabulary 3D Visual Grounding Zhihao Yuan, Jinke Ren, Chun-Mei Feng, Hengshuang Zhao, Shuguang Cui, Zhen Li CVPR 2024 16

17.

LOC Module arXiv:2311.15383v2 [cs.CV] 23 LOC Module 3D Visual Grounding (3DVG) aims at localizing 3D ob4 The University of Hong Kong Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding ject based on textual descriptions. Conventional superText classification loss Fusion ,lizhen@}cuhk.edu.cn Transformer vised methods for 3DVG often necessitate extensive anObject-text matching loss (Yuan+, CVPR 2024) notations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel vi(a) Supervised 3D Visual Grounding sual programming approach for zero-shot open-vocabulary Description Input 3D Scan Object-text Pair LLM It is the keyboard closest 3DVG, leveraging the capabilities of large language modVisual Annotation to the door Program It is the keyboard closest els (LLMs). Our approach begins with a unique dialogto the door LLM based method, engaging with LLMs to establish a foundaPython tional understanding of zero-shot 3DVG. Building on this, Target Category: keyboard Code Text classification loss Fusion we design a visual program that consists of three types of Anchor Category: door Transformer Object-text matching loss modules, i.e., view-independent, view-dependent, and funcTarget Relation Grounding: closest Bbox tional modules. These modules, specifically tailored for 3D scenarios,(a) work collaboratively to perform complex reason(b) Zero-shot 3D Visual Grounding Supervised 3D Visual Grounding ing and inference. Furthermore, we develop an innovative Figure 1. Comparative overview of two 3DVG approaches, LLM It is the keyboard closest language-object correlation module to extend the Visual scope of where (a) Supervised 3DVG involves input from 3D scans comto the door existing 3D object detectors into open-vocabularyProgram scenarbined with text queries, guided by object-text pair annotations, ● 3D Visual Groundingは、3Dシーン内の特定の物体を、テキスト指示に基づいて特定するタスク ios. Extensive experiments demonstrate that our zero-shot LLM (b) Zero-shot 3DVG identifies the location of target objects using approach can outperform some supervised baselines, markprogrammatic representation generated by LLMs, i.e., target cate● 従来の3D Visual Grounding手法は、物体に付与された言語記述をもとに、教師ありでモデルを学 Python Target Category: keyboard ing a significant stride towards effective 3DVG. Code gory, anchor category, and relation grounding, thereby highlight習して対象物体の場所を予測する ing its superiority in decoding spatial relations and object identiAnchor Category: door fiers within a given space, e.g., the location of the keyboard (out● 本研究では、大規模言語モデルを用いた教師なしの3D Visual Grounding手法を提案 Target Relation Grounding: closest lined in green) can be retrieved based on the distance between the 1. Introduction Bbox – Visual Programingを用いることで、3Dシーンにおける複雑な推論を実行 keyboard and the door (outlined in blue). (b) Zero-shot (3DVG) 3D Visual Grounding 3D Visual Grounding aims to localize specific ob– jects 3D物体検出器をオープンボキャブラリに対応させた within 3D scenes by using a series of textual descripFigure 1. tions. Comparative overview of two 3DVG approaches, This has become a crucial component in a variety where (a) of Supervised 3DVG involves input from 3D scans comburgeoning applications, such as autonomous robotics challenge for machines due to their inherently limited perceptual capabilities. Traditional supervised 3DVG approaches [18, 58, 60] 17

18.

提案手法：LLMによる対話形式、3D Visual Programming ● LLMの対話文に3Dシーンに物体の情報をテキストと入力して、対象物体を推論する ● Visual Programingによって、In-Context Learning (文脈内学習）を用いてプログラムを生成して、そのプログラムを実行することで、対象物体を特定する方法 Suppose you are a person standing in a room.You need to find a keyboard it is closest to the door. Of course, I can help you find an object in a room based on its description. Please provide me with the details of the object you're looking for, and I'll do my best to assist you in locating it. Room Information: Object 1 is a door located at ( 0.65, 2.35, 1.05). Object 2 is a desk located at (0.68, 1.30, 0.39). … Object 26 is a keyboard located at (-0.65, -1.06, 0.65). The keyboard closest to the door is Object 9, as it has a shorter distance of approximately 2.01 units, compared to Object 26, which has a distance of approximately 3.44 units. So, the correct object ID is Object 9. a) Dialog with LLM In-context examples Grounding description Description: The round cocktail table in the corner of the room with the blue and yellow poster Program: BOX0=LOC(‘round cocktail table’) BOX1=LOC(‘blue and yellow poster’) TARGET=CLOSEST(targets=BOX0, anchors=BOX1) Description: Staring at the cabinets you want the window on the right side Program: BOX0=LOC(‘window’) BOX1=LOC(‘cabinet’) TARGET=RIGHT(targets=BOX0, anchors=BOX1) LLM Input 3D Scan Visual Program Reasoning Process LOC (‘round cocktail table’) Target Prediction CLOSEST targets=BOX0 anchors=BOX1 LOC (‘blue and yellow poster’) LOC(‘window’) RIGHT targets=BOX0 anchors=BOX1 LOC(‘cabinet’) b) 3D Visual Programming Figure 2. Overview of two zero-shot approaches for 3DVG. (a) shows the working mechanism of the vanilla dialog with LLM approach. 18

19.

LOC関数：言語物体相関モジュール ● テキストに対する物体のBounding Box（物体の位置とサイズ）のリストを返す関数 ● 物体検知で3Dシーンを物体セグメントに分割、セグメントを2D画像に変換して、その画像とテキストのマッチングを見て、所望の物体かを判定する（画像分類 or 質問応答 or VLM） ra has a intrinsic btained by up), BOX0=LOC(object='round cocktail table’) 2D Multi-modal Models Closed-vocabulary Instance Segmentation (1) (2) on function that nslation matrix T ordinate vector, u y-axis on the 2D g to the value of u left or right posilarly, w allows us Filter: Table Image Classification table round cocktail table Question Answering Is there a round cocktail table? no yes General large model Q: Is there a round cocktail table? A: Yes, it is a round cocktail table. 19

20.

X X X X 3D Visual Groundingデータセットを用いた定量評価 X X X X X X X X X X 35.9 36.8 38.4 39.0 Table 5. Ablation study of different view-dependent modules. ● CLOSEST FARTHEST LOWER HIGHER X X X X X X X X X 提案手法は教師なしの手法でもあるにもかかわらず、初期の教師あり3D Visual Grounding手法 X ScanRefer (Chen+, ECCV 2020)に匹敵する性能を示した Accuracy 18.8 30.7 34.0 36.8 39.0 ● 他の教師なしのオープンボキャブラリ手法と比較しても、優れた性能を示した ● Table 6. Ablation study of different view-independent modules. LLMの対話形式より、Visual Groundingの方が性能的にも金銭的にも優れている PointNeXt [38], and PointBERT [59]. For 2D perception, ● LOC関数は、大規模視覚言語モデルBLIP-2を使用することで、性能を改善できる we use an image classification model proposed in [40], a visual question answering model in [24], and a general large model BLIP-2 [27] for testing. The results are shown in Tables 7 and 8. We can observe that our framework is comUnique Multiple Overall patible with other models. Also, it can leverage the adMethod LLM Acc@0.5 Tokens Cost Methods Supervision Acc@0.25 Acc@0.5 Acc@0.25 Acc@0.5 Acc@0.25 Acc@0.5 vancements both 2D and models to Dialog within GPT3.5 25.43D foundational 1959k $3.05 ScanRefer [4] fully 65.0 43.3 30.6 19.8 37.3 24.3 Dialog GPT4 27.5 cross-model 1916k effectiveness $62.6 improve the performance. This TGNN [17] fully 64.5 53.0 27.0 21.9 34.3 29.7 Program GPT3.5 32.1 121k $0.19 demonstrates the robustness and future-proof nature of our InstanceRefer [60] fully 77.5 66.8 31.3 24.8 40.2 32.9 Program GPT4 35.4 115k $4.24 approach in the ever-evolving landscape of visual percep3DVG-Transformer [65] fully 81.9 60.6 39.3 28.4 47.6 34.7 tion models. BUTD-DETR [20] fully 84.2 66.3 46.6 35.1 52.2 39.8 LERF [23] OpenScene [34] Ours (2D only) Ours (3D only) Ours - 20.1 32.5 57.1 63.8 13.1 27.8 49.4 58.4 11.1 16.1 25.9 27.7 4.4 14.6 23.3 24.6 4.8 13.2 20.0 33.1 36.4 0.9 6.5 17.6 29.3 32.7 Table 4. Performance comparison of the dialog with LLM and the visual programming 2D Assistanceapproaches. Unique Multiple Acc@0.25 LEFT RIGHT BEHIND 27.1 BETWEEN 35.7 Accuracy CLIP FRONT62.5 ViLT 60.3 27.1 35.126.5 BLIP-2 63.8 27.7 36.432.4 X Table 2. 3DVG results on ScanRefer validation set. The accuracy on the “unique” subset, “multiple” subset, and whole validation set are all provided. Following [4], we label the scene as “unique” if it only contains a single object of its class. Otherwise, we label it as “multiple”. Method Easy Hard Dep. Indep. Overall X X 35.9 Table 7. Ablation study on different 2D models. X X X 36.8 X X X X 38.4 3D Backbone View-dep. View-indep. Overall X X X X X 39.0 PointNet++ 35.8 39.4 38.2 PointBert 36.0 39.8 38.6 Table 5. Ablation study of different view-dependent modules. PointNeXt 36.8 40.0 39.0 sion. Moreover, our zero-shot approach outperforms the ap- 20

21.

3D Visual Groundingデータセットを用いた定性評価 ● 既存手法と比べて、視点に依存する関係とオープンボキャブラリが必要とされるクエリに強い ● In-Contest Learningに使用する事例と異なる傾向を持つクエリに弱い Ground Truth BUTDDETR Ours (Dialog) Ours (Program) It is a window. It is located above a recycle bin that has a blue top. The rolling office chair. The chair is under the desk. There is a square beige armchair. It is left of a square table. This is a brown piano bench. It is in front of the piano. A desk chair is pushed into a small computer desk. The chair has wheels . (a) (b) (c) (d) (e) 21

22.

Agent3D-Zero: An Agent for Zero-shot 3D Understanding Sha Zhang, Di Huang, Jiajun Deng, Shixiang Tang, Wanli Ouyang, Tong He, Yanyong Zhang ECCV 2024 22

23.

Agent3D-Zero: An Agent for Zero-shot 3D Understanding (ECCV 2024) ● ● 大規模視覚言語モデル（VLM）を用いて、3Dシーンをゼロショットで理解するためのエージェントフレーム Reconstructed 3D Text ワークAgent3D-Zeroを提案 Perceiver Selected Images 従来の3次元シーン理解手法3D-LLM (Hong+, NeurIPS 2023)は、3Dデータを3D Perceiverに読み込んで、その出力結果をLLMに入れて、3次元シーンに対する質問に回答する ● VLM LLM 一方で、Agent3D-Zeroは、 3Dシーンの理解を、人間のように複数の視点からの画像を統合的に解釈するプロ 2 Sha. Author et al. セスとして再定義し、VLMを使って視点選択と視覚・言語課題を解く (a) Finetuning Paradigm Vision Encoder 3D Perceiver Large Language Model 3D Scans Language-involved 3D Understanding Frozen Training (b) Our Zero-Shot Framework Proj. 3D Scans Vision Language Model Selected Observing Views Vision Language Model Language-involved 3D Understanding 3D Perception 23

24.

Agent3D-Zeroの概要 ● VLMがシーンの空間的な関係性をより深く理解できるカメラ位置生成のためのSet-of-Line Promptingを提案 ● 3Dシーンから得られた鳥瞰図画像にグリッド線と目盛りを重ね、3Dシーンとその中のオブジェクトを観察するように18台のカメラパスを作成する ● Abbreviated paper title 5 生成したカメラ位置での画像をVLMに渡して、3D質問応答、意味セグメント、キャプション生成などを解く User Request Here is the bev image of a bed room. Here is the BEV image of a bed room. With With limited in 18 camera limited in 18 camera poses toposes observetothe 3D scene and objects in it, give me the observe the 3D scene and objects in it, best choice. give me the best choice. How many chairs in this scene? Describe the scene. These images describe a bed room, which has a small bathroom with a toilet and a bath-hub. There is a pink bed in the corner and a book shelf next to it… Two. What’s in the table? A laptop and a screen. 3D Question Answering 3D Semantic Segmentation 3D Scene Caption 24

25.

3次元理解タスクでの評価： 3次元質問応答 ● ● ● Abbreviated paper title 9 Table 1: Performance comparison on the ScanQA validation set. ‘Two-stage’ 3次元質問応答データセットScanQA (Azuma+ CVPR 2022)を用いて評価 means the models use explicit object representations. ‘Fine-tune’ means extra train提案手法Agent3D-Zeroは学習なしにもかかわらず、物体検知+質問応答の２段階の教師ありの手法や、ファイ ing. Our proposed Agent3D-Zero is training-free. B-1, B-4 denote BLEU-1, BLEUンチューニングを用いた既存の3次元理解の基盤モデルを一部の指標で上回る 4 [37]respectively. Our model outperforms all related models and the baseline model ランダムに画像を選ぶよりも、鳥瞰図＋VLMでカメラ画像を選ぶことで大幅に性能を向上できる for evaluation metrics METEOR [4], ROUGE-L [31], and CIDEr [42]. Two-stage Fine-tune Zero-Shot GPT-4Vを使用 VoteNet+MCAN ScanRefer+MCAN ScanQA flamingo-SingleImage flamingo-MultiView BLIP2-flant5-SingleImage BLIP2-flant5-MultiView 3D-LLM (flamingo) 3D-LLM (BLIP2-opt) 3D-LLM (BLIP2-flant5) LLaVA-SingleImage Agent3D-Zero (random) Agent3D-Zero (selected) B-1 B-4 METEOR ROUGE-L CIDEr EM 28.0 26.9 30.2 23.8 25.6 28.6 29.7 30.3 35.9 39.3 7.1 16.4 28.6 6.2 7.9 10.1 8.5 8.4 5.1 5.9 7.2 9.4 12.0 0.3 2.1 4.4 11.4 11.5 13.1 10.7 11.3 10.6 11.3 12.2 13.8 14.5 10.5 12.2 16.0 29.8 30 33.3 29.6 31.1 25.8 26.6 32.3 34.0 35.7 12.3 26.9 37.0 54.7 55.4 64.9 52 55 42.6 45.7 59.2 63.8 69.4 5.7 40.0 71.8 17.3 18.6 21.0 16.9 18.8 13.3 13.6 20.4 19.3 20.5 0.0 4.9 17.5 25

26.

3次元理解タスクでの評価：対話、行動計画、キャプション生成 Abbreviated paper title ● ● 11 すべてのタスクで、鳥瞰図＋VLMで取得したカメラ画像が性能向上に寄与 Table 3: Performance comparison on the Held-In Dataset, which is introduced ファインチューニングした3D-LLMと比べて、対話は同程度、行動計画については上回る性能を示した in 3D-LLM [20]. Our zero-shot method outperforms related methods. Tasks 対話 3D-assisted Dialog 行動計画の生成 Task Decomposition キャプション生成 3D Captioning Models flant5 flamingo-SingleImage flamingo-MultiView BLIP2-flant5-SingleImage BLIP2-flant5-MultiView 3D-LLM (flamingo) 3D-LLM (BLIP2-opt) 3D-LLM (BLIP2-flant5) Agent3D-Zero (random) Agent3D-Zero (selected) flant5 flamingo-SingleImage flamingo-MultiView BLIP2-flant5-SingleImage BLIP2-flant5-MultiView 3D-LLM (flamingo) 3D-LLM (BLIP2-opt) 3D-LLM (BLIP2-flant5) Agent3D-Zero (random) Agent3D-Zero (selected) Agent3D-Zero (random) Agent3D-Zero (selected) BLEU-1 BLEU-4 METEOR ROUGE-L 27.4 29.4 30.6 28.4 32.4 35.0 39.6 39.0 26.9 32.8 25.5 31.4 33.1 32.2 33.1 32.9 34.1 33.9 33.8 42.0 26.1 29.5 8.7 9.4 9.1 9.1 9.5 10.6 16.2 16.6 7.1 9.8 6.0 7.1 7.3 6.9 6.9 6.4 7.6 7.4 6.7 15.5 1.0 7.2 9.5 10.0 10.4 10.2 11.0 16.0 18.4 18.9 17.2 19.3 13.9 15.6 16.1 15.0 15.5 16.0 16.5 15.9 16.7 22.9 13.9 15.9 27.5 26.8 27.9 27.4 29.5 34.2 38.6 39.3 30.9 39.3 28.4 30.6 33.2 31.0 34.0 33.5 35.4 37.8 36.6 45.1 14.3 16.1 26

27.

[beta]

go to the stove 12. turn on the stove 13. cook the meal on the stove 14.
turn off the stove 15. go to the table . What should I do next? ",
"answers": ["set the table with utensils and plates "], "scene_id":
"scene0651_01", "path":
"/gpfs/u/home/LMCG/LMCGnngn/scratch/gradslam/examples/voxelized_features_sa
Sha. Author et al.
m_nonzero_preprocess/scene0651_01.pt"}

3次元理解タスクの定性評価： キャプション生成、行動計画
12

a) 3D Scan of a kitchen

Abbreviated paper title

13

W

b) Selected images
S

N

Please bring me the document printed on the
printer in the southeast corner.

E

Printer

3D Scene Caption: The scene is a compact and utilitarian kitchen space. The kitchen is equipped with essential
appliances including a white refrigerator with a top freezer compartment, and a white electric range oven with
coil burners and control knobs at the front. There is a corner kitchen counter that provides a modest amount of
workspace, flanked on one side by the range and on the other by a stainless steel sink with a faucet. Above the
counter, there are white cabinets offering storage for kitchen essentials. One cabinet has a built-in microwave
nook above the range. The adjacent wall has a set of drawers and additional under-counter storage. Flooring
transitions are visible from carpet to a more wipe-clean surface. Overall, the kitchen is modest but seems wellorganized.
Task Decomposition:

prepare a meal in the kitchen

1. go to the kitchen area
2. open the kitchen cabinets
3. take out ingredients and utensils from the
kitchen cabinets
4. go to the refrigerator
5. open the refrigerator
6. take out items from the refrigerator
7. go to the kitchen counter

Observation of environment: west, east, north, and south

Instruction of Action

Go east.

Since the printer is in the
southeast and there is no
printer in the east, go
south.

Keep going south

8. wash vegetables and prepare them on counter
9. go to the sink
10. wash hands in the sink
11. go to the stove
12. turn on the stove
13. cook the meal on the stove
14. turn off the stove
15. go to the table

There is no printer in the
south direction. Go east
next.

Bingo! Find the printer. It is
in the south corner.

Fig.
4:
Visualization
of
navigation
in
real
world.
The
top
section
introduces
the
Agent-Zeroが作成したキャプション、行動計画
実世界での行動計画の実験
Fig. 3: Visualization
of 3D Scene Caption and Task Decomposition of
navigation task and provides an overview of an oﬃce setting. Subsequent rows feature
Agent3D-Zero. The top part presents the raw 3D scan and some of the images
observations of the environment, with GPT-4v-generated instructions on the right.

27

28.

Coarse Correspondence Elicit 3D Spacetime Understanding in Multimodal Language Model Renlin Liu, Yuhao Dong, Yiqin Wang, Yongming Rao, Yansong Tang, Wei-Chiu Ma, Ranjay Krishna ECCV 2024 28

29.

Coarse Correspondence Elicit 3D Spacetime Understanding in Multimodal Language Model (Liu+, arXiv 2024) ● 3次元シーンの理解や長時間の映像課題に効果的なVisual Promptを提案 – 軽量な追跡モデルを使用して、映像のフレーム間または画像の視点間で物体の対応関係を見つけ – 最も出現する物体を選択して、そららの物体に対して一意のIDを持つマーカーを付与 z x T … y (a) track objects in high frame rate videos 2 1 2 3 1 2 1 3 (b) construct coarse correspondence on sparsified views MLLMs (c) help MLLMs understand 3D space-time with prompted images Figure 1: We combined light-weight video tracking models and multimodal LLMs to achieve a better 29

30.

Existing benchmarks evaluate the 3D spatial awareness of a model from the perspective of the camera-wielding observer. To further demonstrate the utility of C OARSE C ORRESPONDENCES , we Coarse Correspondenceは3次元空間の理解に役立つか？ introduce a new benchmark: SOT. Our benchmark tests whether models are able to reason about 3D space from the perspective of an imaginary observer at another location in their field of view. Again, ● we 3次元質問応答データセットScanQAを用いて3次元空間を理解できているかどうかを評価 show that C OARSE C ORRESPONDENCES significantly improves GPT-4{V,O}’s abilities. ● Gemini、Claude、GPT-4Vについて、Coarse Correspondenceを用いることで、3次元空間理解 4.1 Spatial understanding with ScanQA 能力が向上し、GPT-4V+CCについては3D-LLMに匹敵する性能を示した Methods Frame LLaVA [4] Flamingo [43] BLIP2 [27] 3D-LLM [12] - 7.1 25.6 29.7 39.3 Gemini+CC Claude Claude+CC 8 8 8 8 GPT-4V GPT-4V+CC GPT-4O GPT-4O+CC Visual Promptあり Gemini METEOR ROUHE-L CIDEr 2.6 15.2 16.2 25.2 10.5 11.3 11.3 14.5 12.3 31.1 26.6 35.7 5.7 55 45.7 69.4 24.1 25.4 19.8 27.1 13.5 15.7 11.1 23.9 11.3 12.0 10.0 11.7 35.4 37.1 29.3 33.1 68.3 75.5 57.7 65.7 8 8 28.6 39.7 13.4 25.5 13.5 17.4 33.4 40.8 59.6 79.2 4 4 30.5 35.4 19.8 25.5 14.8 18.0 36.1 42.6 72.2 87.0 B 1 B 2 Table 1: Comparison on ScanQA validation set. Following 3D-LLM, we conduct experiments 30

31.

Baselines. We compare against language-only models to account for language bias (LLaMA2 [25]), commonly used general-purpose multimodal LLMs (GPT-4 [50], Claude3 [26], Gemini-Pro [8], Coarse Correspondenceは時間的理解に役立つか？ GPT-4V with 15 and 50 frames. ● ● Results. C OARSE C ORRESPONDENCES again achieves state-of-the-art performance while utilizing significantly fewer frames than all previous works (Table 2). This substantial reduction in the number 一人称映像質問応答データセットEgoSchema (Mangalam+, NeurIPS 2023)を用いて、長い映像 of views highlights the efficiency of our approach during inference time. The improved performance, を理解できているかどうかを評価 coupled with reduced computational overhead, charts a potential future for using C OARSE C OR Coarse Correspondenceを用いることで、映像理解能力がやや向上し、GPT-4O+CCについては RESPONDENCES for embodied AI tasks. Despite our marks being generated unconditioned on the 訓練データを使わずに、最新の手法を上回る性能を示した questions, they still improve performance. Models What is the overarching behavior of C and in the video? Whatthe is man the overarching behavior of C and the man in the video? LLaMA2 [25] GPT-4 [50] Claude3 [26] Gemini-Pro [8] GPT-4V [19] GPT-4V [19] Full Video Link: youtu.be/DIyyVccQPbg Full Video Link: youtu.be/DIyyVccQPbg Human C teaches the man game rules but the man seems distracted and is not paying attention C teaches the man game rules but the man seems distracted and is not paying attention The man teaches C how to play the card game while organizing decktofor future The man teachesthe C how play the games card game while organizing the deck for future games C and the man are playing a card game while keeping track it inplaying a notebook C and the manofare a card game while keeping track of it in a notebook C shows the man how to properly shuffle cards while man plays them shuffle C shows thethe man how to properly cards while the man plays them The man shows C a new card game while The mannotes shows a newreference card game while C takes forCfuture C takes notes for future reference t = 0s t = 0s t = 60s t = 60s GPT-4V GPT-4V+CC t = 20s t = 20s GPT-4O GPT-4O+CC t = 80s t = 80s Frame Accuracy Models Frame Subset LongViviT [51] MC-ViT-L [52] LLoVi [53] VideoAgent [54] MVU [55] VideoAgent [56] LangRepo [57] 256 128+ 180 8.4 16 - 56.8 62.6 58.3 60.2 60.3 62.8 66.2 44.8 58.5 GPT-4V GPT-4V+CC 8 8 64.2 67.4 49.4 59.1 GPT-4O GPT-4O+CC 8 8 67.2 73.2 0 28.3 0 33.5 20 36.3 15 44.9 15 54.6 50 Full Video Link: youtu.be/Tp4q5GeHVMY 55.3 Full Video Link: youtu.be/Tp4q5GeHVMY Full 86.8 t = 0s t = 0s t = 20s t = 20s t = 40s t = 40s t = 60s t = 60s t = 80s t = 80s t = 100s t = 100s t = 120s t = 120s t = 140s t = 140s t = 180s t = 180s 8 8 t = 40s t = 40s 4 4 t = 100s t = 100s Observe the video in terms of characters' Observe in termsHow of characters' actions the andvideo interactions. do these actions and interactions. How donarrative? these shifts contribute to the overall shifts contribute to the overall narrative? The video displays a profound sense of conflict The displays a profound of conflict andvideo tension arising between sense the characters and tension arising between the characters The man is showing C the issues that need The man is showing C the issues that need fixing in the apartment in a professional manner fixing in the apartment in a professional manner Both the characters display an increasingly Both the characters display an increasingly urgent need to solve an issue in the apartment urgent need to solve an issue in the apartment C and the man admire and interact with several Cobjects and theinman admire and interact with several the apartment that look beautiful. objects in the apartment that look beautiful. Actions and interactions are casual and Actions and interactions are casual and relaxed, reflecting a comfortable environment. relaxed, reflecting a comfortable environment. Table 2: Comparisons on EM-EQA setting Table 3: Comparisons on EgoSchema valiFigure 1: The EgoSchema dataset contains over 5000 very long-form video language understanding OpenEQA . Our method further Figureof 1: The EgoSchema dataset contains over 5000 very long-form videoenhances language understandingdation set. C OARSE C ORRESPONDENCES imquestions spanning over 250 hours of real, diverse, and high-quality egocentric video data. Each t = 120s t = 120s t = 140s t = 140s t = 180s t = 180s 31

32.

[beta]

.4

空間的方向感覚テストによるCoarse Correspondenceの評価

The SOT benchmark for Spatial Orientation Test

n cognitive
science,
the
spatial
orientation
test
(SOT)
is
a
widely
adopted
examination
for
spatial
● 自分の物理的位置とは異なる位置からの眺めを想像する能力（空間的視点能力）があるかを検証
ntelligence in children [20]. The SOT assesses spatial perspective-taking—the ability to imagine
● 空間的視点能力は、子どもの空間知能の発達と密接に関係している
(Tversky+, Cognitive 2009)
ow an object
or scene would appear from a perspective different from the current camera viewpoint.
Numerous human studies [21, 22] have shown that this ability is closely related to the development
●
Coarse
Correspondenceの導入によってVLMの空間的視点能力を大幅に改善できる
f spatial intelligence in children. In this section, We adopt this test to evaluate multimodal LLMs.
メインエントランス

観察者視点＋他者視点の空間理解

エレベーター

60

Models

Frame

Origin

Reverse

Harmonic Mean
50

! Question A: A.Left. From the two images you

upload, the elevator is on the left side of the
building’s main entrance.
Question A Type: Observer Perspective Understanding
From the observer‘s perspective, on which side of the
elevator is the building's main entrance? A. Left B. Right
Question B Type: Spatial Perspective Taking
If Frank has just entered the building through the main
entrance, on which side is the elevator from Frank's
perspective? A. Left B. Right. Please answer from Frank's
perspective, not the observer’s.

! Question B: A.Left. Based on the images, the
elevator is on the left side.

GPT-4O

2

58.2

50.0

53.8

GPT-4O+CC

2

71.6

70.6

71.1
30

GPT-4O

4

58.0

50.4

53.9

GPT-4O+CC

4

71.2

71.2

71.2

Frame

Origin

Reverse

Harmonic Mean

" Question A: B.Right. From the two images

GPT-4O

you upload, the elevator is on the right side of
2
58.2
50.0
53.8
the building’s main entrance.

GPT-4O+CC

"2 Question
B:
B.Right.
Based
on
the
images
71.6
70.6
71.1

MLLM

GPT-4O
GPT-4O+CC

you upload, from Frank’s perspective, the
elevator is on the left.

4

58.0

50.4

53.9

4

71.2

71.2

71.2

20

Table 4: Comparisons on SOT. C OARSE C ORRE SPONDENCES他者視点の空間理解
shows strong capability of enhancing
3D spatial understanding of MLLMs. It can also ease
the striking finding of camera motion
bias
of
current
w/o CC
w/ CC
Random Guess
MLLMs.
60

Models

40

Fig
sp
CO
pro
pe

50

+5.3%

+4.2%

5 Analysis
40

Here, we explore the various design decisions implicit in ou
30

How does C OARSE C ORRESPONDENCES differ from o
proposed method calculates and highlights correspondences
2 frames
4 frames
temporal understanding.
Other visual
prompting methods (
and Chain-of-thought [16, 17, 33]) can also be viewed as alte
20

32

33.

従来のプロンプトとの比較 ● 従来のPromptと比較しても、Coarse Correspondenceは、GPT-4Vに3次元の空間関係の理解を促し、正しい答えを導き出すことに成功している System: You are an AI with the ability to analyze a series of images, each representing a different perspective of a single scene. [Prompt-about-Marks]. Your task is to construct a 3D understanding based on these images. User: You are sitting on the sofa and the electric fan is on your left. Describe the location of the room door from your perspective: A. to the front left of you; B. to the front right of you; C. to the back left of you; D. to the back right of you. +Coarse Correspondence (Ours) +Set-of-Mark GPT-4V: The fan is seen on the right side from the sofa’s perspective, its leftward when sitting means the orientation is flipped(✔). Therefore, the door that is visible on the right in the first image would be D. behind and to the right when seated on the sofa(❌). View 1 View 2 GPT-4V: The first image ... a door marked with number 7, a fan marked with number 39, and items on shelves marked with numbers 11 and 37. In the second image, ... the fan again at number 39, and a sofa marked with number 3(✔). We can infer that the door and the fan are on the same side of the room(✔). Since the fan appears to the left in the first image and is also on the left when sitting on the sofa as per the second image and your description, the door must be to the front of the sofa, as it is not visible in the second image from sofa's perspective(✔). By 'sitting on the sofa' as instructed, and with the electric fan to the left, I can determine the door is A. to the front left of you (✔). View 1 View 2 +3DAxiesPrompts GPT-4V: View 1 View 2 The first image shows ... and a coordinate system with X-axis (horizontally rightwards)(❌) and Z-axis (vertically upwards). The second image shows a sofa on the left with a coordinate system's Y-axis (horizontally extending away from the viewer) (❌) and Z-axis as before. When sitting on the sofa with the fan to your left, the door would be B. diagonally opposite to you on the right side (❌). +Chain-of-Thought GPT-4V: System: Your response should be as follows. 1)Identification: Identify all labeled objects. 2)Analysis: Analyze the objects and their positions construct a 3D model. 3)Synthesis: Combine the info above, form a comprehensive understanding. 4)Response: Provide a concise answer to the question based on your comprehensive 3D modelling. Identification: Door, shelves, sofa, ... (✔) Analysis: in image 1 the door is on the right with the shelves adjacent to the door ... In image 2 the sofa is ... (✔). Synthesis: When sitting on the sofa, the shelves are on the right(❌). The door is next to the shelves. Response: B.(❌). 33

34.

まとめ ● 3次元シーン理解に関する基盤モデルについて紹介 – 半自動生成したInstruction TuningデータでLLMをファインチューニング – Visual Programingを用いてFew-shotで3D Visual Groundingを解く – 鳥瞰図画像からカメラポーズを生成することで、Zero-shotで3次元シーン理解タスクを解く – フレームをまたぐ物体間の関係を表すプロンプトを用いて3次元シーン理解能力を上げる ● 感想 – LLMを使って3次元シーン理解のデータを半自動作成する研究は最近よく見かける – 3次元視覚課題については、効果的であればVisual Promptingだけでもまだ戦えそう – 人間のように複数の視点からの画像を統合的に解釈することで、3Dシーンを理解する試みは今後も続きそう – 映像と3次元空間を同時に理解できる基盤モデルが必要？ 34

Foundation Models for 3D Scene Understanding

Deep Learning JP

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

【拡散モデル勉強会】拡散モデルの数理

【拡散モデル勉強会】Introduction to Diffusion Models

【拡散モデル勉強会】拡散モデルのサンプラーまとめ

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

各ページのテキスト