CheXalign: Preference fine-tuning in chest X-ray interpretation models without human feedback

>100 Views

September 27, 25

スライド概要

ACL2025で気になった論文のうち,輪読会で共有した論文です.

profile-image

初めまして 医療AIに興味があります。 松尾研究室のDL輪読会やAcademiX Medicalに所属しています。 大学の研究室やいくつかのインターン先で、医療系のデータ(テキスト・画像・センサデータ)の解析に関する研究しております。

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

ダウンロード

関連スライド

各ページのテキスト
1.

Paper Reading 25.07.15 CheXalign: Preference fine-tuning in chest X-ray interpretation models without human feedback Yuki Tashiro

2.

Background About the Paper 2 • Title: • CheXalign: Preference fine-tuning in chest X-ray interpretation models without human feedback • CheXalign: 胸部X線解釈モデルにおける人間のフィードバックなしのPreference finetuning • Authors: • Dennis Hein, Zhihong Chen, Sophie Ostmeier, Justin Xu, Maya Varma, Eduardo Pontes Reis, Arne Edward Michalson, Christian Bluethgen, Hyun Joo Shin, Curtis Langlotz, Akshay S Chaudhari • Conference • ACL 2025 Main Conference Paper

3.

Background First Author 3

4.

Background Last Author 4

5.

Background Reasons why I choose this paper • My interests • Medical Applications • Multi Modal Model • (The latest paper) • →Checked papers in ACL 2025 • CheXalign • Other papers which I didn’t choose • introduce them after I present this paper 5

6.

Introduction Motivation • Radiologists face staff shortages and growing workloads, risking delayed interpretations. • • Automated VLM assistants can help but require extremely high accuracy in this highstakes domain. • • 胸部X線画像の解釈は最も一般的な診断手順のひとつであり、年間14億件以上が実施される。 Existing vision-language models require costly human-in-the-loop feedback to align outputs with radiologist preferences. • • 自動化されたVLMのアシスタントは有望だが、このハイリスクはドメインでは高度な正確性が不可欠 Chest X-ray interpretation is one of the most common diagnostic procedures, exceeding 1.4 billion exams per year. • • 放射線科医は人員不足と業務増加に直面し、読影遅延の危険性あり 既存のビジョン・ランゲージモデルは、放射線科医の嗜好に合わせるために高コストな人手フィードバッ クを必要とする。 Supervised fine-tuning alone can lead to overfitting and degraded report quality, limiting clinical scalability. • 教師あり微調整のみでは過学習やレポート品質の低下を招き、臨床スケールでの運用が困難になる。 6

7.

Introduction Contributions • Automated Preference Pair Pipeline • Systematic Benchmarking • Auto-generates preference pairs using reference metrics, eliminating manual feedback. ・参照指標で嗜好ペアを自動生成し、人手フィードバック不要。 • Evaluates across metrics, algorithms, and models, demonstrating general-domain NLG metrics work. ・指標・アルゴリズム・モデル横断評価で汎用NLG指標の有効性を実証。 • New SOTA on MIMIC-CXR • Length-Controlled Metric • Achieves top CheXbert scores on RRG without sacrificing factual accuracy. ・事実精度を維持しつつMIMIC-CXRで最高CheXbertスコア達成。 • Introduces LC-GREEN to prevent report-length exploitation. ・レポート長の悪用を防ぐLC-GREEN指標を導入。 • Robustness Across Tasks • Confirms fine-tuning gains generalize to diverse CXR perception and reasoning tasks. ・多様なCXRタスクで微調整効果の一般化を確認。 7

8.

Related Works Related Works • Vision-Language Models in Radiology • • • • • • DPO (Rafailov et al. 2023) offers a closed-form alternative to RLHF; further variants include LC-DPO (length regularization), IPO (Azar et al.), KTO (Ethayarajh et al. 2024), ORPO (Hong et al. 2024) DPO(Rafailov et al. 2023)やその拡張版(LC-DPO, IPO, KTO, ORPO)など、軽量な直接嗜好最適化手法 LLM-as-a-Judge Approaches • • • RLHF frameworks (Ziegler et al. 2020; Stiennon et al. 2020; Ouyang et al. 2022) use human-labeled preferences and reinforcement learning (PPO, REINFORCE) to align LLMs with user judgments 人間の嗜好ラベル+強化学習(PPO, REINFORCE)を用いてLLMをユーザ好みに整合させるRLHF手法 Direct Preference Optimization (DPO) • • Recent works (e.g., BioViL, CheXagent, CheXagent-2) use image–text contrastive pretraining and supervised fine-tuning on datasets like MIMIC-CXR and CheXpert for report generation (RRG) 画像–テキスト対比学習+教師あり微調整により、MIMIC-CXRやCheXpertデータでレポート生成を行う手法(BioViL, CheXagent, CheXagent-2) Preference Fine-Tuning via Human Feedback (RLHF) • • 8 General-domain studies (Dubois et al. 2023; Lee et al. 2024; Zheng et al. 2023) generate preference pairs and evaluate outputs using LLM-based metrics (BERTScore, GPT-based grading) without human annotators 人手不要でLLMを判定者とし、BERTScoreなどのメトリクスで嗜好データを自動生成・評価する手法 Reference-Based Evaluation Metrics in Radiology • • • • GREEN (Ostmeier et al. 2024): LLM-based factuality metric for CXR reports CheXbert (Smit et al. 2020): Clinical label extraction + comparison metric Standard NLG metrics (BLEU, ROUGE, BERTScore) have also been applied for fact-based evaluation 線出レポート向けの事実整合性評価指標としてGREEN、臨床所見精度指標としてCheXbert、さらにBLEU/ROUGE/BERTScoreの応用 Reward Overoptimization & Hallucination • • • Hong et al. 2024 demonstrate SFT alone can inadvertently raise “bad” output likelihood Gao et al. 2023 analyze reward-hacking phenomena; Park et al. 2024 report length-based gaming; Zhou et al. 2024 observe VLM hallucinations in radiology 教師あり微調整のみでは「悪い」生成も増加する問題(Hong et al. 2024)、報酬ハッキング(Gao et al. 2023)、冗長化によるメトリクス攻略(Park et al. 2024)、 放射線VLMの幻覚現象(Zhou et al. 2024)

9.

Methodology 1. RRG Preference Fine-tuning without Human Feedback • 9 Step1: Leverage public CXR datasets (e.g., MIMIC-CXR) with radiologist‐written reports. • MIMIC-CXR等の公開CXRデータセットと放射線科医作成 レポートを活用。 Overview • Step1: Generate large‐scale preference pairs (preferred vs. rejected) without new human labels. • 新規アノテーション不要で大量の嗜好ペアを自動生成 • Step2: Use reference-based metrics (e.g., GREEN) as automated “Judges” to compare model outputs vs. references. • 参照報告とモデル出力をGREEN等のメトリクスで自動比 較し、優劣を判定。 • Step3: Apply canonical alignment algorithms (e.g., DPO) on these pairs to fine-tune the VLM policy. • 生成した嗜好ペアを用い、DPO等でVLMポリシーを嗜好 に沿って微調整。

10.

Methodology 1. Baseline Model: CheXagent & CheXagent-2 10 Feature CheXagent CheXagent-2 Params 8B 3B Vision Encoder EVA-CLIP-g Fine-tuned SigLIP Language Model Mistral-7B Phi-2 memo SOTA models at the time smaller, yet more performant version

11.

Methodology 2. Evaluation • 11 Radiology-Specific Metrics: • GREEN: LLM-based factuality metric for CXR reports (single-answer, reference-guided). • CXR報告の事実性評価用LLMメトリクス(単一回答型参照ガイド付き)。 • CheXbert Score: Extracts 14 clinical labels (e.g., cardiomegaly, pneumonia) via CheXbert labeler; measures clinical correctness. • CheXbertラベラーで14の臨床所見を抽出し、報告の臨床的正確さを評価。 • Reward Hacking Concern: • Models may “game” GREEN by generating excessively long reports. • GREENを稼ぐために報告を過度に冗長化する傾向を確認。 • Mitigation – LC-GREEN(Length-controlled GREEN): • Define LC-GREEN = GREEN / max(rel_verbosity, 1), where rel_verbosity = (candidate length)/(reference length). • LC-GREEN = GREEN ÷ max(相対冗長性, 1)、相対冗長性=生成報告長/参照報告長。 • Penalizes excessive verbosity, ensuring metric gains reflect true factual accuracy. • 冗長化へのペナルティで、スコア向上が事実性の本質的改善を反映するよう調整。

12.

Methodology 2. Evaluation (GREEN) 12 • GREEN: LLM-based factuality metric for CXR reports (single-answer, reference-guided). • • Generative Radiology Report Evaluation and Error Notation CXR報告の事実性評価用LLMメトリクス(単一回答型参照ガイド付 き) • An LLM-based metric designed to evaluate the factual correctness of generated radiology reports against a reference report. • 生成された放射線科レポートを、参照レポートと比較して、その事実 としての正しさを評価するために設計されたLLMベースの指標。 • It calculates a score from 0 to 1 by rewarding matched findings and heavily penalizing clinically significant errors. • 一致した所見を報酬とし、臨床的に重大なエラーに重いペナルティを 与えることで、0から1のスコアを算出する。 • Provides not only a quantitative score but also a qualitative, human-readable summary explaining the specific errors found. • 定量的なスコアだけでなく、発見された具体的なエラーを説明する、 人間が解釈可能な定性的な要約も提供する。 Sophie Ostmeier,, et. al. 2024. GREEN: Generative Radiology Report Evaluation and Error Notation. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 374–390, Miami, Florida, USA. Association for Computational Linguistics.

13.

Experiments & Results setup • Base Models: CheXagent (8B) & CheXagent-2 (3B) – SOTA open-source CXR VLMs with specialized vision encoders & LLMs. • CheXagent(80億), CheXagent-2(30億):最先端オープンソースVLM。 • Datasets: MIMIC-CXR (148k train, 80k subset for CheXagent), CheXpert Plus test for generalization. • MIMIC-CXR(訓練148k例/80k例)、汎化用にCheXpert Plusテストセッ ト。 13

14.

Experiments & Results Results 14 • Factual Accuracy ↑: GREEN score improved up to +57.4% vs. SFT baseline. • GREENスコアがベースライン比最大+57.4%改善。 • State-of-the-art CheXbert: New SOTA on MIMIC-CXR for clinical label F1 accuracy. • CheXbertラベルF1でMIMIC-CXR新記録樹立。

15.

Experiments & Results 1. Length Exploitation • a positive correlation between average GREEN and average length for CheXagent 15

16.

Experiments & Results 1. Length Exploitation • a manifestation of reward hacking via length exploitation • 長さの悪用による報酬のハッキングの出現 16

17.

Experiments & Results 2. Judge Optimization Results • Boosts automated evaluation scores, increasing the GREEN score by up to 31.9%. • CheXalignは自動評価スコアを向上させ、GREENスコアは最大で31.9%向上 17

18.

Experiments & Results 3. Generalization to CheXbert Scores • Boosts automated evaluation scores, increasing the GREEN score by up to 31.9%. • CheXalignは自動評価スコアを向上させ、GREENスコアは最大で31.9%向上 18

19.

Experiments & Results 4. Alignment Tax Analysis • The method substantially improves report quality without compromising performance on other image understanding tasks. • 提案手法は、他の画像理解タスクの性能を損なうことなく、レポートの品 質を大幅に向上 19

20.

Experiments & Results Limitations • Generality: Needs testing on other VLM architectures & modalities. • 他モデル・他モダリティでの再現性検証が必要。 • Hyperparameters: Search wasn’t exhaustive; further tuning may yield better configs. • ハイパーパラメータ探索の拡張が今後の課題。 • Scope of Methods: On-policy RLHF and human-in-the-loop not explored. • PPO型RLHFや人間介在型の併用は未検討。 • Reference Quality: Assumes high-quality reports; biases/gaps in references may propagate. • 参照報告の品質依存性とバイアス継承リスクあり。 20

21.

Conclusions Conclusions • CheXalign: Automated preference fine-tuning pipeline without human feedback, bridging AI alignment & medical imaging. • CheXalign: 人手不要の自動嗜好調整パイプライン。 • Significant Gains: +57.4% GREEN, SOTA CheXbert, humanpreferred outputs (62%). • GREEN+57.4%, CheXbert新記録, 放射線科医62%が好む。 • No Alignment Tax: Maintained performance on auxiliary visionlanguage tasks. • 補助タスクでの性能低下なし。 • Toward Safer AI: Scalable path to improve factuality & safety of medical VLMs. • 医療VLMの事実性と安全性をスケーラブルに向上。 21

22.

My view Analysis • ACL 2025 Papers • 1700 papers were accepted as main conference papers • 38 paper’s title included words related to medical domain at least • 188 papers were related to the multi-modality. • image, audio, speech • 16 papers were related to both multi-modal and medical application. • The radiology domain was the most frequently covered. 22

23.

My view Analysis • Benchmark • Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models • Constructs a comprehensive benchmark that evaluates models across two axes: 15 diverse medical specialties (e.g., cardiology, dermatology) and 8 core clinical capabilities (e.g., finding recognition, disease analysis, treatment planning). It uses human expert answers as a gold standard. • NGQA: A Nutritional Graph Question Answering Benchmark for Personalized Health-aware Nutritional Reasoning • Radiology • CheXalign: Preference fine-tuning in chest X-ray interpretation models without human feedback • Online Iterative Self-Alignment for Radiology Report Generation • Automated Structured Radiology Report Generation • The Impact of Auxiliary Patient Data on Automated Chest X-Ray Report Generation and How to Incorporate It 23