Perspectives on World Models and Predictive Coding in Cognitive Robotics

10.9K Views

October 11, 23

スライド概要

Material for the invited talk at the IROS 2023 workshop on "World Models and Predictive Coding in Cognitive Robotics"

profile-image

人工知能とか機械学習とか深層学習とかの研究してます

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

(ダウンロード不可)

関連スライド

各ページのテキスト
1.

Perspectives on World Models and Predictive Coding in Cognitive Robotics Masahiro Suzuki The University of Tokyo 10/4/2023 (updated in 10/11/2023) 1

2.

Outline ¤ World models and deep generative models ¤ Predictive coding, free energy principle, and active inference ¤ Challenges 2

3.

Our survey paper 3

4.

The advancement of artificial intelligence based on deep learning ¤ Foundation models and large language models (LLMs) ¤ By learning from a vast amount of data over a long time, they perform well across a wide range of tasks. [Rombach+ 22] ¤ Challenges in realizing autonomous intelligence in the real world using these models. ¤ They learn to map inputs to outputs. ¤ There is a need for humans to curate large amounts of data in advance. 4

5.

Learn to map inputs to outputs ¤ Foundational models predict the output when given an instruction (prompt). ¤ More detailed instructions by human. -> intelligence serve as a tool rather than being autonomous. Input: What is a world model? LLM Output: A “world model” can be understood in multiple contexts... ¤ Humans interact with the environment to understand the structure of the world (the consequences of actions). 5

6.

Need for humans to curate large amounts of data in advance ¤ Current foundation models need a lot of data for learning. ¤ In the realm of natural language, vast amounts of data can be obtained from sources like the internet. ¤ Humans don't need as much data as foundation models. ¤ It’s impossible to learn all real-world data. ¤ Rather than relying on the "passive" methods of LLMs, intelligent agents should "actively" acquire data from their external environment. => The importance of world models and free energy principles (predictive coding). 6

7.

World Models ¤ Humans cannot perceive all aspects of the world. ¤ The amount of information that our brain receives is limited. ¤ The brain constructs models of the real world using this limited information. ¤ World model: ¤ A model that learns to approximate the structure of the world based on limited observations from the external environment. ¤ This model infers underlying “causes” from its observations, then predicts (or generates) future outcomes and previously unknown observations based on these inferred causes. World model Environment Approximate Infer Observation Predict Cause 7

8.

Background of world models ¤ Helmholtz’s theory of unconscious inference: ¤ Humans continuously interpret external stimuli using an internal representation (world model). ¤ This process of interpretation relies on inductive inference. ¤ Infer the underlying "cause" when presented with an image, which is the "result." Infer Result (image) Room Cause 8

9.

Background of world models ¤ Internal models in the cerebellum ¤ Humans control motion based on internal models. ¤ The process involves learning through feedback errors [Kawato+ 87, Kawato+ 92]. ¤ Internal models in control ¤ Modeling is also done using the Kalman filter [Wolpert+ 95]. [Rao 99] 9

10.

Representation learning in world models ¤ In the brain, information from the environment is processed and compressed into both spatial and temporal representations. ¤ Example: a person riding a bicycle compresses the representation of "riding a bicycle" into a spatial and temporal representation. World model Infer [McCloud 93][Ha+ 18] Predict Cause ¤ This corresponds to hierarchical inference of representations. ¤ The world model performs representation learning through inference. 10

11.

Prediction by world models ¤ The learned world model is used to simulate the future. ¤ It is believed that humans are constantly doing this. ¤ Example: hitting a ball with a bat ¤ The interval between the ball's launch and its contact with the bat is shorter than the time required for visual information to travel to the brain, process it, and then determine the bat's swing and muscle movements. ¤ We unconsciously make predictions based on our internal world model and move our muscles to match those anticipations. World model Infer Predict Cause [McCloud 93][Ha+ 18] 11

12.

Definition of world models ¤ “A self-supervised predictive model of how the world evolves, both due to its intrinsic dynamics and your actions”[Kim+ 20] ¤ Given an agent’s action 𝑎! at observation 𝑥! at time 𝑡, what subsequent observation 𝑥!"! . and internal state 𝑧!"# will be obtained? [Taniguchi+ 22] 12

13.

Advantages of World Models ¤ Acquiring a world model that incorporates one's own actions allows for dynamic predictions (as a simulator). ¤ We can move freely in space and time within our imagination. ¤ Forward prediction in time aids effective planning. ¤ For robots, mastering the world model allows them to learn without physically moving. ¤ State representation learning enables the extraction of the world's state representation from observations. ¤ If a compact state representation is obtained, it's possible to predict the future better than just based on observations. ¤ The world can be made differentiable. ¤ “Making the World Differentiable” [Schmidhuber 1990] ¤ While the world is inherently a black box, leveraging a differentiable world model facilitates efficient policy learning. 13

14.

Challenges of World Models ¤ The world is vast, so it's difficult to model everything. ¤ Instead of predicting "how the world transitions", predict "what happens when an agent interacts with the world". ¤ It's impossible to observe everything in the world. ¤ This leads to a challenge of dealing with partial observations. ¤ Uncertainty is inherent in the world. ¤ Models must be designed to account for this uncertainty. 14

15.

Modeling Methods for World Models ¤ Generative world models (variational world models): ¤ Learning with generative models that have latent state variables (Partially Observable Markov Decision Process; POMDP) 𝐚!"# 𝐚!"$ 𝐚! 𝐳!"# 𝐳! 𝐳!%# 𝐱!"# 𝐱! 𝐱!%# ¤ Self-supervised transition models: ¤ Learning in a self-supervised manner to predict the subsequent observation. 𝐱! transition model 𝐱!%# 𝐚! 15

16.

Modeling methods for world models ¤ Generative world models (variational world models): ¤ Learning with generative models that have latent state variables (Partially Observable Markov Decision Process; POMDP) 𝐚!"# 𝐚!"$ 𝐚! 𝐳!"# 𝐳! 𝐳!%# 𝐱!"# 𝐱! 𝐱!%# ¤ Self-supervised transition models: ¤ Learning in a self-supervised manner to predict the subsequent observation. 𝐱! transition model 𝐱!%# 𝐚! 16

17.

Generative Models ¤ A framework that assumes that observed data is generated from an unknown data distribution and models the generation process using probability distributions. ¤ In addition to observed variables that are observed as data, latent variables are often assumed as probabilistic variables underlying the observed variables (latent variable models). ¤ It is possible to explicitly design "how the data is generated" and generate (simulate) data from the model. 𝑝!"#" 𝐱 Observed variable 𝑝 𝑥 Approximate 𝑝! 𝐱 =$$ 𝑝! 𝐱|𝐳 𝑝(𝐳)𝑑𝐳 Data distribution Ge ne ra ra ne Ge te 観測データ 𝒟 = 𝐱! te 𝐳 Latent variable 𝜃 ⽣成モデル Parameter Generative model 𝐳 ~ 𝑝(𝐳) 𝐱 ~ 𝑝$ (𝐱|𝐳) 𝐱 $ !"# 17

18.

Inference in generative models ¤ Inference: ¤ Obtaining the posterior of the latent variable given the observed variable. ¤ This is an important concept in generative models with latent variables (deducing the cause from the result). 𝑝% (𝐳|𝐱) 𝐳 𝐱~𝑝! (𝐱|𝐳) 𝐱 𝜽 ¤ In general models, inference is often computationally difficult. ¤ Various approximate inference methods have been proposed. In this presentation, we distinguish between "inference," which means obtaining the posterior of latent variables, and "estimation" or "learning," which means optimizing the parameter values of the model. 18

19.

Deep generative models ¤ When the observed variables are complex, the generative process cannot be directly represented by a simple probability distribution. How to represent complex relationships? -> deep neural networks (DNNs) ¤ Deep generative models (DGMs): ¤ Generative models that represent probability distributions using DNNs. 𝐳 ~ 𝑝(𝐳) 𝐳 ¤ Model parameters are learned based on gradient information. 𝐱 ~ 𝑝$ (𝐱|𝐳) 𝐱 The ability of generative models to explicitly model the generative process. + The ability of DNNs to capture complex relationships between variables. 19

20.

Variational Autoencoder ¤ Variational autoencoder (VAE) [Kingma+ 13, Rezende+ 14]. ¤ DNN representation of the probability distribution of a latent variable model (deep latent variable model). Inference model (amortized variational inference) 𝐳 ' ( 𝑞! 𝐳 𝐱 = 𝒩(𝐳|𝝁 = 𝑔& 𝐱 , 𝝈 = 𝑔& (𝐱)) 𝝁 𝝈 𝐱 ~ 𝑝$ (𝐱|𝐳) 𝑞& (𝐳|𝐱) 𝐱 𝐳 ~ 𝑝(𝐳) Prior 𝑝(𝐱) = 𝒩(0, 𝑰) Generative model 𝑝" 𝐱 𝐳 = ℬ(𝐱|𝝀 = 𝑓# 𝐳 ) 𝐳 𝐱 𝝀 Precisely 𝑝) 𝐱, 𝐳 = 𝑝) 𝐱 𝐳 𝑝(𝐳) represents a generative model, but by convention 𝑝) 𝐱 𝐳 is referred to as the generative model. 20

21.

Variational Autoencoder ¤ Objective: evidence lower bound (ELBO) or negative variational free energy log 𝑝% (𝐱) ≥ 𝔼'0 𝐳 𝐱 log 𝑝% 𝐱|𝐳 − 𝐷() [𝑞& 𝐳 𝐱 ∥ 𝑝 𝐳 ]= − ℱ(𝐱) Prediction error Regularization variational free energy ¤ The inference model and the generative model can be regarded as an encoder and a decoder in an autoencoder. Reconstruction Input 𝐱 Encoder 𝑞0 𝐳 𝐱 𝐳 Decoder 𝑝! 𝐱|𝐳 𝐱 21

22.

rmed through the inverse CDF of the Gaussian to produce h of these values z, we plotted the corresponding generative Generated images ¤ Sampling image 𝐱 from decoder based on random 𝐳 ¤ Generated images are similar to the dataset, but tend to have blurred contours. [Kingma+ 13] @AlecRad 22

23.

Generated images ¤ Nouveau VAE (NVAE) [Vahdat+ 20] ¤ Hierarchize the latent variables in VAE. ¤ Advantages: ¤ Acquiring hierarchical representations. ¤ Improving the expressive power of the entire model. ¤ Enabling more flexible inference. 23

24.

VAE and representation learning ¤ In deep generative models, representation learning is equivalent to inference 𝐳~𝑞% (𝐳|𝐱). ¤ It acquires the representation by inferring from the input to the latent variable. ¤ Representation learning: ¤ Acquiring "good representation" from data (ideally unsupervised). https://www.slideshare.net/lubaelliott/emily-denton-unsupervised-learningof-disentangled-representations-from-video-creative-ai-meetup ¤ Good representation: a representation that retains some of the properties of the original data and can be reused for other tasks. ¤ Meta-Prior [Bengio+ 13, Goodfellow+ 16]: ¤ Refers to assumptions about the properties of representations that can be used for many tasks. ¤ Including manifold, disentangle, a hierarchy of concepts, semi-supervised learning, and cluster properties, and more. 24

25.

Conditional VAEs ¤ Deep generative models conditioned on the observed variable 𝒚 (information different from 𝐱) ¤ They represent the generative process from 𝒚 to 𝐱. ¤ 𝒚 is independent of 𝐳. Another information 𝐳 𝐱 𝑝 𝐱 = ∫ 𝑝 𝐱 𝐳 𝑝 𝐳 𝑑𝐳 Latent variable 𝐲 𝐳 𝐱 Input 𝑝 𝐱|𝐲 = ∫ 𝑝 𝐱 𝐳, 𝐲 𝑝 𝐳 𝑑𝐳 25

26.

Conditional VAEs ¤ Generating image 𝐱 from attribute 𝐲 [Larsen+ 15] Autoencoding beyond pixels using a learned similarity metric on cti t pu In co tru ns Re ld Ba s ng Ba Bl ac air kh p ws eu bro ak ye air ir ses e h ha ym y las d v h g y n s a e a o ale Bu Gr Bl M Ey He he tac us M le Pa in sk Published as a conference paper at ICLR 2016 Figure 5. Using the VAE/GAN model to reconstruct dataset samples with visual attribute vectors added to their latent representations. ¤ Generating image 𝐱 from text 𝐲 [Mansimov+ 15] as demonstrated by Denton et al. (2015); Radford et al. (2015). Lately, convolutional networks with upsampling have shown useful for generating images from a latent representation. This has sparked interest in learning image embeddings where semantic relationships can be expressed using simple arithmetic – similar to the suprising results of the word2vec model by Mikolov et al. (2013). First, Dosovitskiy et al. (2015) used supervised training to train convolutional network to generate chairs given highlevel information about the desired chair. Later, Kulkarni et al. (2015); Yan et al. (2015); Reed et al. (2015) have demonstrated encoder-decoder architectures with disentangled feature representations, but their training schemes rely on supervised information. Radford et al. (2015) inspect the latent space of a GAN after training and find directions A stop sign is flying in blue skies. A herd of elephants flying in the blue skies. 5. Discussion The problems with element-wise distance metrics are well known in the literature and many attempts have been made at going beyond pixels – typically using hand-engineered measures. Much in the spirit of deep learning, we argue that the similarity measure is yet another component which can be replaced by a learned model capable of capturing high-level structure relevant to the data distribution. In this work, our main contribution is an unsupervised scheme for learning and applying such a distance measure. With the learned distance measure we are able to train an image encoder-decoder network generating images of unprecedented visual fidelity as shown by our experiments. Moreover, we show that our network is able to disentangle factors of variation in the input data distribution and discover visual attributes in the high-level representation of the la- A toilet seat sits open in the grass field. A person skiing on sand clad vast desert. 26

27.

Multimodal VAEs ¤ Model the joint distribution of different modalities, 𝑝(𝑥, 𝑦). ¤ With appropriately training, they have the potential to generate with arbitrary conditioning (bidirectional transformations, 𝑝(𝑥│𝑦), 𝑝(𝑦|𝑥)). ¤ The latent variables acquire a representation that integrates the two modalities (shared representation). Shared representation 𝑧 𝑥 𝑦 𝑝 𝑥, 𝑦 = > 𝑝 𝑥|𝑧 𝑝 𝑦 𝑧 𝑝 𝑧 𝑑𝑧 ¤ This concept can be actualized with VAEs (multimodal VAEs). ¤ Increase the number of decoders for each modality. 27

28.

Base (random) Not Male Bald Smiling JMVAE ¤ JMVAE [Suzuki+ 17]: ¤ Enables learning of bi-directional transformations and shared representations ¤ Example: image (a) (𝑥) and attribute (𝑦) 58 4 Input Generated attributes Average face Reconstruction Not Male Eyeglasses Not Young Smiling Mouth slightly open Male : 0.95 Eyeglasses : -0.99 Young : 0.30 Smiling : -0.97 ! ! ! Male : 0.22 Eyeglasses : -0.99 Young : 0.87 Smiling : -1.00 ! ! ! Shared representation space (b) 4.10 *6 Bi-directional transformations *7 28

29.

Development of multimodal VAEs ¤ A survey of multimodal deep generative models [Suzuki+ 22] 29

30.

Deep generative models for world models ¤ World Model [Ha+ 18] ¤ A world model composed of a VAE and MDN-RNN [Graves + 13, Ha+ 17] ¤ The VAE (V module) learns spatially compressed representations of the environment, and the MDN-RNN (M module) learns temporal transitions. ¤ Trained in a game environment. 30

31.

Reinforcement Learning in the World Model ¤ Reinforcement learning is performed within the learned world model. ¤ The process mirrors human cognitive functions like mental rehearsal or sleep learning. ¤ Unlike the real world, learning can be repeated many times within the world model. ¤ On testing the agent in the real world (an actual game environment), it can be verified that the agent can perform the desired behavior. 31

32.

Generative world models 𝐚!"# 𝐚!"$ 𝐚! 𝐳!"# 𝐳! 𝐳!%# 𝐱!"# 𝐱! 𝐱!%# ¤ Objective (variational free energy): ℱ 𝐱#:! , 𝐚#:!1# ' = − ∑ 𝔼(* !&# 𝐳+ ∣𝐱 ,:+ ,𝐚,:+., log𝑝$ 𝐱 ! ∣ 𝐳! Prediction error for observation − 𝔼(* 𝐳+., ∣𝐱 ,:+., ,𝐚,:+./ 𝐷./ 𝑞% 𝐳! ∣ 𝐱#:! , 𝐚#:!1# ∥ 𝑝$ 𝐳! ∣ 𝐳!1# , 𝐚!1# Prediction error for state ¤ Introducing memory into inference models and generative models using RNNs (RSSM)[Hafner+ 19]. 32

33.

State representation learning in latent space ¤ Learning the invariant structure of the world from the agent's perspective and actions [Gregor+ 19]. ¤ Visualizing the representations learned in the state space. ¤ The agent walking around creates a map of the environment in the state space. ¤ The agent ensures that the representation in the learned space is consistent, leading to more predictable and efficient navigation. https://www.youtube.com/watch?v=dOnvAp_wxv0 33

34.

World models and reinforcement learning ¤ In reinforcement learning, using a world model is expected to provide higher sample efficiency and task transferability. ¤ Challenges: ¤ How should an agent collect data from the external world? ¤ How should the model be learned from data? ¤ How should the policy be learned on the world model? [Kaiser+ 20] 34

35.

World models with a reword prediction model ¤ Dreamer [Hafner+ 20]: ¤ Incorporates a prediction model of reward 𝑟: 𝑝(𝑟$ |𝐳$ ). ¤ Predicts future rewards based on long-term predictions in the latent space as a value function. ¤ Learns policies using the value function as the objective. 𝑟$%& 𝑟$ 𝐚$%& 𝐚$%' 𝑟$(& 𝐚$ 𝐳$%& 𝐳$ 𝐳$(& 𝐱$%& 𝐱$ 𝐱$(& 35

36.

Improvements to Dreamer ¤ Dreamer v2 [Hafner+ 21] (illustrated on the lower left): ¤ Discretized state representation and modified the regularization learning (by increasing the learning rate of prior). ¤ Achieved significantly better results than model-free methods on Atari games. ¤ Dreamer v3 [Hafner+ 23] (lower right): ¤ Expanded the model size and applied normalization techniques (using the symlog function). ¤ Accomplished complex tasks such as diamond collection in Minecraft without human demonstrations. 36

37.

Masked World Model ¤ Masked Autoencoder [He+ 21]: ¤ As a pre-training step for Vision Transformer [Dosovitskiy+ 20], it is trained to reconstruct input images by randomly masking patches of the image. ¤ Masked World Model [Seo+ 22]: ¤ By using Masked Autoencoder for representation learning in world models, the performance of modeling interactions with small objects improves, leading to high performance in manipulation experiments. 37

38.

LLMs for world models ¤ Transformers are Sample-Efficient World Models [Micheli+ 23]: ¤ Utilized a sequence modeling approach for predicting transitions in the latent space with Transformers. ¤ Accomplished human-level performance on the Atari benchmark within approximately 2 hours. 38

39.

DayDreamer ¤ DayDreamer [Wu+ 22]: Application of Dreamer to Real-World Robotics ¤ Learning the world model from data collected by the robot interacting with the environment. ¤ The robot learns policies solely based on the learned world model. ¤ The use of a world model allows for efficient learning and adaptability to new tasks and perturbations. 39

40.

Object-centric world models ¤ Object-centric world models (object-centric representation learning): ¤ A framework for recognizing (inferring) and generating representations for each object in an image or video without explicit supervision. ¤ Prepare latent variables for each object. [Greff+ 20] [Lin+ 20] [Veerapaneni + 19] 40

41.

Separation of object-centric representations ¤ We propose a model that separates the representations of objects that are related to interaction (dynamic representations, such as positions) and those that are not related (global representations, such as colors) [Nakano+ 23]. ¤ We successfully separated the representations so that we can change only the color of the object without changing its position. Dynamic representation Global representation 41

42.

Separation of object-centric representations ¤ Separating the representations improved the long-term prediction and planning performance of the world model [Nakano+ 23]. ¤ Long-term prediction: ¤ Existing methods (OP3) fail in prediction in the middle (highlighted in red). ¤ Planning: ¤ We plan operations in the world model to approach the goal image and execute them in the actual environment. ¤ When the number of objects increases, the proposed method outperforms existing methods. 42

43.
[beta]
Control as inference
¤ Dreamer: policies are learned using the value function based on reward prediction
model as the objective.
¤ Can we learn policies with variational free energy as the objective (as variational inference)?

¤ Control as inference:
¤ Introduce an optimality variable 𝒪! ∈ {0,1} to evaluate whether state 𝐳! and action 𝐚! are
optimal.
¤ The distribution that the optimality variable follows is given by 𝑝 𝒪! = 1 𝐳! , 𝐚! = exp(𝑟(𝐳! , 𝐚! )) ,
where 𝑟 is the reward function.
¤ Deriving the optimal policy 𝑝)*$ 𝐚$ |𝐳$ is synonymous with inference from 𝒪!:' = 1 and 𝑧! .

𝑝789 𝐚9 |𝐳9 = 𝑝 𝐚9 |𝐳9 , 𝒪9:; = 1

43

44.
[beta]
Control as inference
¤ Objective (variational free energy):
𝑂$%&

ℱ 𝐱#:! , 𝒪#:'
;

= −𝔼=2

𝐳3:4,𝐚3:4

= 𝑟(𝐳0 , 𝐚0 )

2 log 𝑝(𝒪9 |𝐳9 , 𝐚9 ) −log𝑞0 𝐳9 ∣ 𝐚9

𝑂$
𝐚$%&

𝐚$%'

𝑂$(&
𝐚$

𝐳$%&

𝐳$

𝐳$(&

𝐱$%&

𝐱$

𝐱$(&

9>?
Objective of reinforcement learning
;

− ∑ 𝔼=2
9>?

𝐳5∣𝐱3:5,𝐚3:563

log𝑝! 𝐱 9 ∣ 𝐳9

Prediction error for observation (objective of the world model)

¤ Within the CAI framework, it's possible to separate the learning of the world model
(perception) and the optimization of the policy (control).
44

45.

The free energy principle ¤ In theoretical neuroscience, the free energy principle considers a unified framework to explain perception, action, and learning [Friston 10]. ¤ Perception, learning, and action are all considered as variational inference within the generative model. ¤ The brain aims to minimize “surprise”. ¤ The objective is the same (variational free energy) as in generative world models. ¤ Predictive coding [Rao+ 99] ¤ In the case of deep generative models, point estimates are used rather than computing posterior of parameters. 45

46.

Active inference and expected free energy ¤ Action selection in the free energy principle: ¤ Agents make action choices to achieve preferable recognition over the future using their internal model. => active inference ¤ Given a sequence of actions (policy) 𝝅 = [𝒂! , . . . , 𝒂 '1# ], it takes the expected value of free energy based on future observations and state predictions (expected free energy). ¤ We choose the policy 𝜋 that minimizes this objective. 𝝅 𝐚$%& 𝐚$%' ¤ Objective (expected free energy)︓ 𝐺9 (𝝅) = −𝔼= 𝐱5,𝐳5∣𝝅 log 𝑝 𝐱 9 ∣ 𝝅 − 𝔼= 𝐱5∣𝝅 𝐷CD 𝑞 𝐳9 ∣ 𝐱 9 , 𝝅 ||𝑞 𝐳9 ∣ 𝝅 𝐚$ 𝐳$%& 𝐳$ 𝐳$(& 𝐱E $%& 𝐱E $ 𝐱E $(& 46

47.

The meaning of expected free energy 𝐺9 (𝝅) = −𝔼= 𝐱5,𝐳5∣𝝅 log 𝑝 𝐱 9 ∣ 𝝅 − 𝔼= extrinsic value 𝐱5∣𝝅 𝐷CD 𝑞 𝐳9 ∣ 𝐱 9 , 𝝅 ||𝑞 𝐳9 ∣ 𝝅 intrinsic value ¤ First term: Represents how much the expected observation aligns with the prior belief (likelihood) about the observation. ¤ Assigns a high value for observations that align with the belief (observations with high likelihood). ¤ Represents the extrinsic value and is a goal-oriented term (equivalent to the value of exploitation). ¤ Second term: Represents the expected value of information gain. ¤ Assigns a higher value when the amount of newly acquired information (Bayesian surprise) is large. ¤ Represents the intrinsic value and promotes actions that yield novel observations (equivalent to the value of exploration). 47

48.

Control as inference vs. active inference 𝑂$%& 𝑂$ 𝐚$%& 𝐚$%' 𝝅 𝑂$(& 𝐚$%& 𝐚$%' 𝐚$ 𝐚$ 𝐳$%& 𝐳$ 𝐳$(& 𝐳$%& 𝐳$ 𝐳$(& 𝐱$%& 𝐱$ 𝐱$(& 𝐱E $%& 𝐱E $ 𝐱E $(& ¤ CAI: Considers the goal as an additional extrinsic element in unbiased perception. ¤ Allows for the separation of the objectives of perception and control, making it easier to handle in model-based reinforcement learning. ¤ Active inference: Considers that biased perception is vital for adaptive action selection. ¤ Views the loop of action and perception as an indistinguishable continuous flow. ¤ It is natural for organisms and neuro-robotics. 48

49.

Challenges of world models and active inference in DL ¤ Compared to executing latent state inference or predictions, learning the model takes time. ¤ After a robot agent collects data from the external environment, it stores them as a dataset (replay buffer) and learns from it. ¤ Cannot sequentially update its own model. 49

50.

Challenges of world models and active inference in DL ¤ In conventional world model research, only a single modality (mostly images) has been handled. ¤ Humans acquire abstract representations based on various modality information. [Shi+ 19] Þ Importance of multimodal learning in world models and active inference. 50

51.

Challenges of world models and active inference in DL ¤ Need for hierarchical abstraction integrating observation and action: ¤ Current world models have learned abstract representations of observations but not of actions. ¤ Primitive actions (like lifting, pushing, etc.) should be autonomously acquired through interactions with the environment. ¤ When creating autonomous agents in LLM, primitive actions are provided as an API. ¤ In the field of robotics, a similar approach was previously introduced and explored [Yamashita+ 12]. ¤ Integrate perceptual and behavioral representations generated from the environment and link them to knowledge such as LLM. 51

52.

Conclusion ¤ Explained the relationship between the world models and the free energy principle. ¤ Explained several remaining challenges: ¤ Time required to learn the model itself compared to inference ¤ Need for multimodal observation ¤ Need for hierarchical abstraction integrating observations and actions 52