ICPR 2026 · Lyon, France

Advancing Comprehensive Reasoning in Multimodal Large Language Models

From visual perception to complex multimodal reasoning: foundations, benchmarks, training strategies, and real-world applications.

August 21, 2026 International Conference on Pattern Recognition

Reasoning beyond perception

Multimodal Large Language Models have rapidly evolved from perception-oriented vision-language systems to general-purpose agents capable of complex reasoning over images, videos, text, and structured knowledge. Despite impressive progress, comprehensive multimodal reasoning remains challenging due to visual ambiguity, compositional reasoning, hallucination, temporal understanding, and the gap between language-based reasoning and grounded visual evidence.

This tutorial provides a systematic overview of reasoning capabilities in MLLMs, covering the transition from basic visual perception to complex inference. We will discuss how multimodal reasoning differs from text-only reasoning, key challenges in visual and visual-language reasoning, recent progress inspired by reasoning-focused LLMs, and practical strategies for building, evaluating, and deploying reasoning-capable MLLMs.

Core topics

01

Foundations of Multimodal Reasoning

How MLLMs connect visual perception, language understanding, commonsense reasoning, and symbolic abstraction.

02

Visual Reasoning Challenges

Grounding failures, hallucination, spatial-temporal reasoning, compositional inference, and robustness.

03

Reasoning-Centric Training and Evaluation

Instruction tuning, chain-of-thought style supervision, data curation, benchmark design, and evaluation protocols.

04

Applications and Future Directions

MLLM agents, robotics, human motion understanding, scientific reasoning, and trustworthy multimodal AI.

Tutorial schedule

Schedule to be announced

The detailed tutorial program, including talk titles, speaker order, and session timing, will be updated after the final ICPR 2026 tutorial schedule is confirmed.

Talkers

Organizers

Jun Liu

Jun Liu

Lancaster University

Professor / Chair in Digital Health at the School of Computing and Communications, with research interests in computer vision and human-centered AI.

Homepage
Yiwei Wang

Yiwei Wang

University of California, Merced

Assistant Professor in the Department of Computer Science and Director of the UC Merced NLP Lab, focusing on Natural Language Processing and Multimodal Large Language Models.

Homepage
Yujun Cai

Yujun Cai

University of Queensland

Lecturer (Assistant Professor) at the School of Electrical Engineering and Computer Science, with research interests in multi-modal understanding and trustworthy large models .

Homepage
Junsong Yuan

Junsong Yuan

University at Buffalo, SUNY

Professor and Director of the Visual Computing Lab at Department of Computer Science and Engineering, focusing on computer vision and video understanding.

Homepage

Resources

Slides will be released before the tutorial.