Xingyu Fu

Xingyu Fu (府星妤)

Email: xingyufu@princeton.edu

👋 I am a Postdoctral Fellow at Princeton University's PLI, working with Sanjeev Arora, Danqi Chen, and Zhuang Liu.

My research primarily focuses on generative multimodal models at the intersection between vision and natural language (e.g., multimodal LLMs, text-to-image/video generation, omni models). I aim to improve the perception and reasoning capabilities of multimodal models by bridging them together. I have built better evaluations for emergent abilities, and used synthetic data to design models that can better perceive and reason about the multimodal world.

I did my Ph.D. in Computer Science at the University of Pennsylvania advised by Prof. Dan Roth. During my PhD, I have interned at Microsoft and AWS AI Labs. I received my B.S. in Computer Science from UIUC in 2020, where I was very fortunate to be advised by Prof. Jiawei Han.

I'm always open to collaborations. Send me an email if you're interested!

📑 Research Projects

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, Cha Zhang

ICML 2025

[paper] [website] [code] [dataset] [twitter]

Science-T2I: Addressing Scientific Illusions in Image Synthesis

Jialuo Li, Wenhao Chai, Xingyu Fu, Haiyang Xu, Saining Xie

CVPR 2025

[paper] [website] [code] [dataset]

MUIRBENCH: A Comprehensive Benchmark for Robust Multi-image Understanding

Fei Wang*, Xingyu Fu*, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, Tianyi Lorena Yan, Wenjie Jacky Mo, Hsiang-Hui Liu, Pan Lu, Chunyuan Li, Chaowei Xiao, Kai-Wei Chang, Dan Roth, Sheng Zhang, Hoifung Poon, Muhao Chen

ICLR 2025

[paper] [website] [code] [dataset] [twitter]

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

Yushi Hu*, Weijia Shi*, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, Ranjay Krishna

NeurIPS 2024

[paper] [website] [code] [twitter]

Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

Xingyu Fu, Muyu He, Yujie Lu, William Yang Wang, Dan Roth

COLM 2024

[paper] [website] [code] [dataset] [twitter]

BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu*, Yushi Hu*, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma†, Ranjay Krishna†

ECCV 2024, Spotlight of cVinW@CVPR 2024, 36K total downloads.

[paper] [website] [code] [dataset] [eval] [twitter] [ Paper of the day]

Deceptive Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination?

Bangzheng Li, Ben Zhou, Fei Wang, Xingyu Fu, Dan Roth, Muhao Chen

NAACL. 2024.

[paper] [website] [code] [dataset]

ImagenHub: Standardizing the evaluation of conditional image generation models

Max Ku, Tianle Li, Kai Zhang, Yujie Lu, Xingyu Fu, Wenwen Zhuang, Wenhu Chen

ICLR. 2024.

[paper] [website] [code] [dataset] [visualization]

Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge

Xingyu Fu, Sheng Zhang, Gukyeong Kwon, Pramuditha Perera, Henghui Zhu, Yuhao Zhang, Alexander Hanbo Li, William Yang Wang, Zhiguo Wang, Vittorio Castelli, Patrick Ng, Dan Roth, Bing Xiang

ACL findings. 2023.

[paper] [website] [code]

Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering

Xingyu Fu, Ben Zhou, Sihao Chen, Mark Yatskar, Dan Roth

Arxiv. 2023.

[paper]

There's a Time and Place for Reasoning Beyond the Image

Xingyu Fu, Ben Zhou, Ishaan Chandratreya, Carl Vondrick, Dan Roth

ACL (Oral). 2022.

[paper] [code]

Design Challenges in Low-resource Cross-lingual Entity Linking

Xingyu Fu*, Weijia Shi*, Xiaodong Yu, Zian Zhao, Dan Roth

EMNLP. 2020.

[paper] [code]

Constrained sequence-to-sequence semitic root extraction for enriching word embeddings

Ahmed El-Kishky*, Xingyu Fu*, Aseel Addawood, Nahil Sobh, Clare Voss, Jiawei Han

WANLP @ ACL. 2019.

[paper]

Xingyu Fu (府星妤)

🌟 Recent highlights

📑 Research Projects

ReFocus: Visual Editing as a Chain of Thought for Structured Image Understanding

Science-T2I: Addressing Scientific Illusions in Image Synthesis

MUIRBENCH: A Comprehensive Benchmark for Robust Multi-image Understanding

Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

BLINK: Multimodal Large Language Models Can See but Not Perceive

Deceptive Semantic Shortcuts on Reasoning Chains: How Far Can Models Go without Hallucination?

ImagenHub: Standardizing the evaluation of conditional image generation models

Generate then Select: Open-ended Visual Question Answering Guided by World Knowledge

Dynamic Clue Bottlenecks: Towards Interpretable-by-Design Visual Question Answering

There's a Time and Place for Reasoning Beyond the Image

Design Challenges in Low-resource Cross-lingual Entity Linking

Constrained sequence-to-sequence semitic root extraction for enriching word embeddings

🎤 Invited Talks

💼 Work Experience