【拡散モデル勉強会】TEncDM: Understanding the Properties of Diffusion Model in the Space of Language Model Encodings

3.7K Views

June 25, 24

#自然言語処理 #拡散モデル #テキスト生成 #Transformer #BERT

スライド概要

YouTubeはこちら→https://youtu.be/GoPeArFbCIg

Deep Learning JP

@DeepLearning2023

スライド一覧

DL輪読会資料

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

【DL輪読会】KAN: Kolmogorov–Arnold Networks

Deep Learning JP 86.7K

【DL輪読会】Evolutionary Optimization of Model Merging Recipes モデルマージの進化的最適化

Deep Learning JP 59.8K

【拡散モデル勉強会】拡散モデルの数理

Deep Learning JP 57.5K

【拡散モデル勉強会】Introduction to Diffusion Models

Deep Learning JP 40.7K

【DL輪読会】Cosmos World Foundation Model Platform for Physical AI

Deep Learning JP 35.9K

【DL輪読会】Conditional Flow Matching

Deep Learning JP 35.9K

各ページのテキスト

DEEP LEARNING JP TEncDM: Understanding the Properties of Diffusion Model in [Diffusion Papers] the Space of Language Model Encodings ⾼城頌太（東京⼤学⼯学系研究科松尾研 D1） http://deeplearning.jp/ 1

http://deeplearning.jp/

書誌情報タイトル： TEncDM: Understanding the Properties of Diffusion Model in the Space of Language Model Encodings https://arxiv.org/pdf/2402.19097 ACL 2024 著者： Alexander Shabalin, Viacheslav Meshchaninov. Tingir Badmaev, Dmitry Molchanov, Grigory Bartosh, Sergey Markov, Dmitry Vetrov 概要：⾔語埋め込み空間におけるテキスト拡散モデルの特性についての調査 2

https://arxiv.org/pdf/2402.19097

テキスト⽣成のアプローチ • ⾃⼰回帰モデル(Autoregressive model: ARモデル) – 前から順に単語を⽣成 • ⾮⾃⼰回帰モデル(Non-autoregressive model: NARモデル) – 各単語を同じタイミングで出⼒特にNARモデルに対して拡散モデルを適⽤した⼿法がいくつか提案されている (※ARモデルでも拡散モデル適⽤する⽅法もあり) 3

テキスト⽣成のためのdiffusion modelの歴史 • Image, Video, Audioでdiffusion modelが⼤きな成功を集めているが，textではまだうまくいっていない • Diffusionを⽤いたテキスト⽣成に関する論⽂も年々増えてはいる Diffusion models in text generation: a survey 4

Diffusion Modelを離散のテキストに適応 • Discrete Text Diffusion Model – tokenのように離散的なものを扱うモデル – token⾃体にノイズをかけそれを取り除くように学習 • Continuous Text Diffusion Model – embeddingのような連続的な値をもつものを扱うモデル – embeddingに対してノイズをかけ，それを取り除くように学習今回はこちらを中⼼に⾒ていく 5

関連研究: Diffusion-LM • 単語のembeddingに対して拡散モデルを適⽤させる • 離散のテキストから埋め込みへ変換するEmbedding Stepと埋め込みからテキスト変換するRounding Stepを同時に学習 Embedding Step Rounding Step Loss function 6

関連研究: LD4LG • 潜在空間への埋め込みに事前学習済みモデルを使⽤ • 事前学習済みのBARTやT5を⽤いて潜在表現上でデノイジング • Decoderには⾃⼰回帰モデルを使⽤ 7

本研究の位置付け • ⽬的 – Text Distribution Modelにおけるベストプラクティスの調査 – ARモデルによる制限を受けないようにする • 貢献 – Pretrained Transformerの潜在空間でテキスト拡散モデルを学習するTEncDMを提案 – Decoderの重要性についての影響を調査(decoderの学習⽅法とアーキテクチャを提案) – Self Conditioningによってdenoising stepを減らせることを⽰す – Noise Schedulingの影響に関する調査(Cosine, sqrt noise schedulingでは不⼗分) 8

問題設定 • Text generation problem 𝑝 𝑦 , 𝑤ℎ𝑒𝑟𝑒 𝑦 = 𝑦!, … , 𝑦" 𝑝 𝑦|𝑥 , 𝑤ℎ𝑒𝑟𝑒 𝑥 = 𝑥!, 𝑥#, … , 𝑥$ , 𝑦 = 𝑦!, … , 𝑦$ • Gaussian diffusion model forward process: 𝑞 𝑧% 𝑧& = 𝒩 loss function: 𝛼% 𝑧&, 1 − 𝛼% 𝐈 (𝛼% ∈ 0, 1 , 𝑡 ∈ 0, 1 ) (𝑧!!"# , … , 𝑧!# , 1 = 𝑡" > 𝑡"#$ > … > 𝑡$ = 0) 9

10.

TEncDM: Text Encoding Diffusion Model • Diffusion encoder, 𝐸'()) – Pre-trained Transformer-based language model(like BERT) • Decoder, D – NARのデコーダー(学習) – 𝐶𝑜𝑟 𝑧% = 𝑧! 𝑡 ∈ 𝑈 0, 0.15 • Diffusion model, 𝑧*̂ – 12 BERT layers – Variance preserving scheme • Self-conditioning(p=0.5) ( 𝑝 ) (1 − 𝑝) 上: train phase, 下: inferense phase 10

11.

補⾜: Self-conditioningとは • • • • Discrete Diffusion Modelではよく使われるテクニックモデル⾃⾝の過去の⽣成サンプルを条件付け変数として使⽤する実験的に拡散モデルのサンプル品質を改善できることが分かっている 𝑧&̂ % = 𝑧*̂ 𝑧% , 𝑡 , 𝑧&̂ % = 𝑧*̂ 𝑧% , 𝑡, 𝑧&̂ %0! Analog Bits: Generating Discrete Data using Diffusion Models with Self-Conditioning 11

https://arxiv.org/abs/2208.04202

12.

実験設定 • Dataset: ROCStories • Metrics: – Perplexity(ppl) – Divergence(div): – Memorization(mem): train setに出てきた4-gramと同様なものが⽣成される割合 – MAUVE Score(過去のDL輪読会の資料) • Model setup – Encoder • BERT(bert-base-cased), T5(t5-base) – Decoder • MLP: two linear layers • Transformer: 3-layer transformer – Noise scheduler • Tan-d noise scheduler(d=9 by default) 12

https://www.slideshare.net/slideshow/mauve-measuring-the-gap-between-neural-text-and-human-text-using-divergence-frontiers/251865241

13.

Encoder, Decoderの違いによる⽣成結果の影響 • Encoder – BERT embはtokenの埋め込み空間, BERT, T5は最終層出⼒を⽤いる – BERT encodingが⼀番performanceが良い • Decoder – 𝐶𝑜𝑟 𝑧% は𝑧% を𝑧! 𝑡 ∈ 𝑈 0, 0.15 に変換してから𝐷𝑒𝑐𝑜𝑑𝑖𝑛𝑔をしていることを意味する – MLP decoderはspecial tokenに過剰適合して学習が進まない 13

14.

Self-conditioningの有無の影響 • Self-conditioningがある場合はstepsが⼤きくなるとMAUVEが減少する – 50 stepsではw/o self-conditioningよりもPPL, MAUVEともに精度が良い – これらの原因は訓練時と推論時におけるz_0の不⼀致によるもの • Magnitude( 顕著になった )を計算すると，step数が多くなるにつれて訓練時との差が 14

15.

Noise schedulerによる⽣成結果の影響 • Lossについて – Reconstruction loss: 潜在変数空間でのloss – Text accuracy loss: tokenの正解率 • Noise shcedulerの違いによる評価指標の結果 – Sqrt noise schedulerは初期に多くのノイズを追加するためcosine noise schedulerよりもperformanceが⾼い – Tan-d noise schedulerは⼀貫してノイズを与えるためperformance が⾼くなる(mauveはtan-9が⼀番良い) 15

16.

AR(⋆), non-diffusion NAR(◦), diffusion NAR(†)との⽐較 • • • • QQP Dataset(paraphrase task), Xsum Dataset(summarization task)での結果どちらのデータセットにおいても既存のNARモデルよりは⾼い性能 BERT encodingの⽅がT5 encodinよりも⾼い性能 Transformerには⼀部指標では勝利 16

17.

Thank you. 17