CheXalign: Preference fine-tuning in chest X-ray interpretation modelswithout human feedback

113 Views

September 27, 25

#ai #machine learning #deep learning #ai in medicine #medical artificial intelligence #医療 #医療ai #generative ai #胸部X線 #医療AI #ファインチューニング #自然言語生成 #嗜好学習

スライド概要

ACL2025で気になった論文のうち，輪読会で共有した論文です．

Yuki Tashiro

@yuki-tashiro

スライド一覧

初めまして医療AIに興味があります。松尾研究室のDL輪読会やAcademiX Medicalに所属しています。大学の研究室やいくつかのインターン先で、医療系のデータ（テキスト・画像・センサデータ）の解析に関する研究しております。

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

【論文紹介】医療LLMsのサーベイ論文：A Survey of Large Language Models in Medicine: Principles, Applications, and Challenges

ai machine learning deep learning ai in medicine medical artificial intelligence 医療医療ai generative ai

Yuki Tashiro 8.2K

眼科AI学会 AIコンペティション 8th solutionと上位解法

ai machine learning deep learning ai in medicine medical artificial intelligence 医療学会 kaggle

Yuki Tashiro 5.8K

猫でも分かるUnreal Engineの学び方 - 超初心者向け編 - 2023 v1.0

ue4 ue5 ue-beginner

エピックゲームズジャパン 1.6M

Unreal Engine5 Lumenの仕組みと肝心なところ

ue5 ue-rendering ue-lumen

エピックゲームズジャパン 1.4M

UE5レンダリングフロー総おさらい(2024) 基礎編！[CEDEC+KYUSHU 2024]

ue5 unreal engine ue-rendering

エピックゲームズジャパン 1.1M

Meta XR SDK(V66-74)でQuestアプリを開発

spatial anchor unity quest pro shaperecognizeractivatestate oculus integration transformfeaturestateprovider building blocks transformrecognizeractivestate ovrsemanticclassification jointdeltaprovider ovrscenemanager jointvelocityactivestate オクルージョン sequenceactivestate scene manager ambisonic depth api metaxraudiosource playerlocomotor meta xr sdk quest3 ovrplayercontroller マルチモーダル meta haptics studio direct touch ui meta xr haptics sdk ovrspatialanchor ovrtrackedkeyboard hapticclipplayer fingerfeaturestateprovider hapticclip ワイドモーションモード wmm mruk mr utility kit voice sdk jointrotationactivestate meta horizon os ui set asw application spacewarp ovr metrics tool unityscene manager colocation discovery コロケーション mx ink passthrough camera api hand tracking microgestures webcamtexturemanager passthroughcamerautils cameraviewermanager hand pose selector recorder

あうぜん 1.1M

各ページのテキスト

Paper Reading 25.07.15 CheXalign: Preference fine-tuning in chest X-ray interpretation models without human feedback Yuki Tashiro

Background About the Paper 2 • Title: • CheXalign: Preference fine-tuning in chest X-ray interpretation models without human feedback • CheXalign: 胸部X線解釈モデルにおける人間のフィードバックなしのPreference finetuning • Authors: • Dennis Hein, Zhihong Chen, Sophie Ostmeier, Justin Xu, Maya Varma, Eduardo Pontes Reis, Arne Edward Michalson, Christian Bluethgen, Hyun Joo Shin, Curtis Langlotz, Akshay S Chaudhari • Conference • ACL 2025 Main Conference Paper

Background First Author 3

Background Last Author 4

Background Reasons why I choose this paper • My interests • Medical Applications • Multi Modal Model • (The latest paper) • →Checked papers in ACL 2025 • CheXalign • Other papers which I didn’t choose • introduce them after I present this paper 5

Introduction Motivation • Radiologists face staff shortages and growing workloads, risking delayed interpretations. • • Automated VLM assistants can help but require extremely high accuracy in this highstakes domain. • • 胸部X線画像の解釈は最も一般的な診断手順のひとつであり、年間14億件以上が実施される。 Existing vision-language models require costly human-in-the-loop feedback to align outputs with radiologist preferences. • • 自動化されたVLMのアシスタントは有望だが、このハイリスクはドメインでは高度な正確性が不可欠 Chest X-ray interpretation is one of the most common diagnostic procedures, exceeding 1.4 billion exams per year. • • 放射線科医は人員不足と業務増加に直面し、読影遅延の危険性あり既存のビジョン・ランゲージモデルは、放射線科医の嗜好に合わせるために高コストな人手フィードバックを必要とする。 Supervised fine-tuning alone can lead to overfitting and degraded report quality, limiting clinical scalability. • 教師あり微調整のみでは過学習やレポート品質の低下を招き、臨床スケールでの運用が困難になる。 6

Introduction Contributions • Automated Preference Pair Pipeline • Systematic Benchmarking • Auto-generates preference pairs using reference metrics, eliminating manual feedback. ・参照指標で嗜好ペアを自動生成し、人手フィードバック不要。 • Evaluates across metrics, algorithms, and models, demonstrating general-domain NLG metrics work. ・指標・アルゴリズム・モデル横断評価で汎用NLG指標の有効性を実証。 • New SOTA on MIMIC-CXR • Length-Controlled Metric • Achieves top CheXbert scores on RRG without sacrificing factual accuracy. ・事実精度を維持しつつMIMIC-CXRで最高CheXbertスコア達成。 • Introduces LC-GREEN to prevent report-length exploitation. ・レポート長の悪用を防ぐLC-GREEN指標を導入。 • Robustness Across Tasks • Confirms fine-tuning gains generalize to diverse CXR perception and reasoning tasks. ・多様なCXRタスクで微調整効果の一般化を確認。 7

Related Works Related Works • Vision-Language Models in Radiology • • • • • • DPO (Rafailov et al. 2023) offers a closed-form alternative to RLHF; further variants include LC-DPO (length regularization), IPO (Azar et al.), KTO (Ethayarajh et al. 2024), ORPO (Hong et al. 2024) DPO（Rafailov et al. 2023）やその拡張版（LC-DPO, IPO, KTO, ORPO）など、軽量な直接嗜好最適化手法 LLM-as-a-Judge Approaches • • • RLHF frameworks (Ziegler et al. 2020; Stiennon et al. 2020; Ouyang et al. 2022) use human-labeled preferences and reinforcement learning (PPO, REINFORCE) to align LLMs with user judgments 人間の嗜好ラベル＋強化学習（PPO, REINFORCE）を用いてLLMをユーザ好みに整合させるRLHF手法 Direct Preference Optimization (DPO) • • Recent works (e.g., BioViL, CheXagent, CheXagent-2) use image–text contrastive pretraining and supervised fine-tuning on datasets like MIMIC-CXR and CheXpert for report generation (RRG) 画像–テキスト対比学習＋教師あり微調整により、MIMIC-CXRやCheXpertデータでレポート生成を行う手法（BioViL, CheXagent, CheXagent-2） Preference Fine-Tuning via Human Feedback (RLHF) • • 8 General-domain studies (Dubois et al. 2023; Lee et al. 2024; Zheng et al. 2023) generate preference pairs and evaluate outputs using LLM-based metrics (BERTScore, GPT-based grading) without human annotators 人手不要でLLMを判定者とし、BERTScoreなどのメトリクスで嗜好データを自動生成・評価する手法 Reference-Based Evaluation Metrics in Radiology • • • • GREEN (Ostmeier et al. 2024): LLM-based factuality metric for CXR reports CheXbert (Smit et al. 2020): Clinical label extraction + comparison metric Standard NLG metrics (BLEU, ROUGE, BERTScore) have also been applied for fact-based evaluation 線出レポート向けの事実整合性評価指標としてGREEN、臨床所見精度指標としてCheXbert、さらにBLEU/ROUGE/BERTScoreの応用 Reward Overoptimization & Hallucination • • • Hong et al. 2024 demonstrate SFT alone can inadvertently raise “bad” output likelihood Gao et al. 2023 analyze reward-hacking phenomena; Park et al. 2024 report length-based gaming; Zhou et al. 2024 observe VLM hallucinations in radiology 教師あり微調整のみでは「悪い」生成も増加する問題（Hong et al. 2024）、報酬ハッキング（Gao et al. 2023）、冗長化によるメトリクス攻略（Park et al. 2024）、放射線VLMの幻覚現象（Zhou et al. 2024）

Methodology 1. RRG Preference Fine-tuning without Human Feedback • 9 Step1: Leverage public CXR datasets (e.g., MIMIC-CXR) with radiologist‐written reports. • MIMIC-CXR等の公開CXRデータセットと放射線科医作成レポートを活用。 Overview • Step1: Generate large‐scale preference pairs (preferred vs. rejected) without new human labels. • 新規アノテーション不要で大量の嗜好ペアを自動生成 • Step2: Use reference-based metrics (e.g., GREEN) as automated “Judges” to compare model outputs vs. references. • 参照報告とモデル出力をGREEN等のメトリクスで自動比較し、優劣を判定。 • Step3: Apply canonical alignment algorithms (e.g., DPO) on these pairs to fine-tune the VLM policy. • 生成した嗜好ペアを用い、DPO等でVLMポリシーを嗜好に沿って微調整。

10.

Methodology 1. Baseline Model: CheXagent & CheXagent-2 10 Feature CheXagent CheXagent-2 Params 8B 3B Vision Encoder EVA-CLIP-g Fine-tuned SigLIP Language Model Mistral-7B Phi-2 memo SOTA models at the time smaller, yet more performant version

11.

Methodology 2. Evaluation • 11 Radiology-Specific Metrics: • GREEN: LLM-based factuality metric for CXR reports (single-answer, reference-guided). • CXR報告の事実性評価用LLMメトリクス（単一回答型参照ガイド付き）。 • CheXbert Score: Extracts 14 clinical labels (e.g., cardiomegaly, pneumonia) via CheXbert labeler; measures clinical correctness. • CheXbertラベラーで14の臨床所見を抽出し、報告の臨床的正確さを評価。 • Reward Hacking Concern: • Models may “game” GREEN by generating excessively long reports. • GREENを稼ぐために報告を過度に冗長化する傾向を確認。 • Mitigation – LC-GREEN(Length-controlled GREEN): • Define LC-GREEN = GREEN / max(rel_verbosity, 1), where rel_verbosity = (candidate length)/(reference length). • LC-GREEN = GREEN ÷ max(相対冗長性, 1)、相対冗長性＝生成報告長/参照報告長。 • Penalizes excessive verbosity, ensuring metric gains reflect true factual accuracy. • 冗長化へのペナルティで、スコア向上が事実性の本質的改善を反映するよう調整。

12.

Methodology 2. Evaluation (GREEN) 12 • GREEN: LLM-based factuality metric for CXR reports (single-answer, reference-guided). • • Generative Radiology Report Evaluation and Error Notation CXR報告の事実性評価用LLMメトリクス（単一回答型参照ガイド付き） • An LLM-based metric designed to evaluate the factual correctness of generated radiology reports against a reference report. • 生成された放射線科レポートを、参照レポートと比較して、その事実としての正しさを評価するために設計されたLLMベースの指標。 • It calculates a score from 0 to 1 by rewarding matched findings and heavily penalizing clinically significant errors. • 一致した所見を報酬とし、臨床的に重大なエラーに重いペナルティを与えることで、0から1のスコアを算出する。 • Provides not only a quantitative score but also a qualitative, human-readable summary explaining the specific errors found. • 定量的なスコアだけでなく、発見された具体的なエラーを説明する、人間が解釈可能な定性的な要約も提供する。 Sophie Ostmeier,, et. al. 2024. GREEN: Generative Radiology Report Evaluation and Error Notation. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 374–390, Miami, Florida, USA. Association for Computational Linguistics.

https://aclanthology.org/2024.findings-emnlp.21/

13.

Experiments & Results setup • Base Models: CheXagent (8B) & CheXagent-2 (3B) – SOTA open-source CXR VLMs with specialized vision encoders & LLMs. • CheXagent(80億), CheXagent-2(30億)：最先端オープンソースVLM。 • Datasets: MIMIC-CXR (148k train, 80k subset for CheXagent), CheXpert Plus test for generalization. • MIMIC-CXR（訓練148k例／80k例）、汎化用にCheXpert Plusテストセット。 13

14.

Experiments & Results Results 14 • Factual Accuracy ↑: GREEN score improved up to +57.4% vs. SFT baseline. • GREENスコアがベースライン比最大+57.4%改善。 • State-of-the-art CheXbert: New SOTA on MIMIC-CXR for clinical label F1 accuracy. • CheXbertラベルF1でMIMIC-CXR新記録樹立。

15.

Experiments & Results 1. Length Exploitation • a positive correlation between average GREEN and average length for CheXagent 15

16.

Experiments & Results 1. Length Exploitation • a manifestation of reward hacking via length exploitation • 長さの悪用による報酬のハッキングの出現 16

17.

Experiments & Results 2. Judge Optimization Results • Boosts automated evaluation scores, increasing the GREEN score by up to 31.9%. • CheXalignは自動評価スコアを向上させ、GREENスコアは最大で31.9%向上 17

18.

Experiments & Results 3. Generalization to CheXbert Scores • Boosts automated evaluation scores, increasing the GREEN score by up to 31.9%. • CheXalignは自動評価スコアを向上させ、GREENスコアは最大で31.9%向上 18

19.

Experiments & Results 4. Alignment Tax Analysis • The method substantially improves report quality without compromising performance on other image understanding tasks. • 提案手法は、他の画像理解タスクの性能を損なうことなく、レポートの品質を大幅に向上 19

20.

Experiments & Results Limitations • Generality: Needs testing on other VLM architectures & modalities. • 他モデル・他モダリティでの再現性検証が必要。 • Hyperparameters: Search wasn’t exhaustive; further tuning may yield better configs. • ハイパーパラメータ探索の拡張が今後の課題。 • Scope of Methods: On-policy RLHF and human-in-the-loop not explored. • PPO型RLHFや人間介在型の併用は未検討。 • Reference Quality: Assumes high-quality reports; biases/gaps in references may propagate. • 参照報告の品質依存性とバイアス継承リスクあり。 20

21.

Conclusions Conclusions • CheXalign: Automated preference fine-tuning pipeline without human feedback, bridging AI alignment & medical imaging. • CheXalign: 人手不要の自動嗜好調整パイプライン。 • Significant Gains: +57.4% GREEN, SOTA CheXbert, humanpreferred outputs (62%). • GREEN+57.4%, CheXbert新記録, 放射線科医62%が好む。 • No Alignment Tax: Maintained performance on auxiliary visionlanguage tasks. • 補助タスクでの性能低下なし。 • Toward Safer AI: Scalable path to improve factuality & safety of medical VLMs. • 医療VLMの事実性と安全性をスケーラブルに向上。 21

22.

My view Analysis • ACL 2025 Papers • 1700 papers were accepted as main conference papers • 38 paper’s title included words related to medical domain at least • 188 papers were related to the multi-modality. • image, audio, speech • 16 papers were related to both multi-modal and medical application. • The radiology domain was the most frequently covered. 22

23.

My view Analysis • Benchmark • Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models • Constructs a comprehensive benchmark that evaluates models across two axes: 15 diverse medical specialties (e.g., cardiology, dermatology) and 8 core clinical capabilities (e.g., finding recognition, disease analysis, treatment planning). It uses human expert answers as a gold standard. • NGQA: A Nutritional Graph Question Answering Benchmark for Personalized Health-aware Nutritional Reasoning • Radiology • CheXalign: Preference fine-tuning in chest X-ray interpretation models without human feedback • Online Iterative Self-Alignment for Radiology Report Generation • Automated Structured Radiology Report Generation • The Impact of Auxiliary Patient Data on Automated Chest X-Ray Report Generation and How to Incorporate It 23