Foundations of Multimodal Reasoning
How MLLMs connect visual perception, language understanding, commonsense reasoning, and symbolic abstraction.
ICPR 2026 · Lyon, France
From visual perception to complex multimodal reasoning: foundations, benchmarks, training strategies, and real-world applications.
Overview
Multimodal Large Language Models have rapidly evolved from perception-oriented vision-language systems to general-purpose agents capable of complex reasoning over images, videos, text, and structured knowledge. Despite impressive progress, comprehensive multimodal reasoning remains challenging due to visual ambiguity, compositional reasoning, hallucination, temporal understanding, and the gap between language-based reasoning and grounded visual evidence.
This tutorial provides a systematic overview of reasoning capabilities in MLLMs, covering the transition from basic visual perception to complex inference. We will discuss how multimodal reasoning differs from text-only reasoning, key challenges in visual and visual-language reasoning, recent progress inspired by reasoning-focused LLMs, and practical strategies for building, evaluating, and deploying reasoning-capable MLLMs.
What you will learn
How MLLMs connect visual perception, language understanding, commonsense reasoning, and symbolic abstraction.
Grounding failures, hallucination, spatial-temporal reasoning, compositional inference, and robustness.
Instruction tuning, chain-of-thought style supervision, data curation, benchmark design, and evaluation protocols.
MLLM agents, robotics, human motion understanding, scientific reasoning, and trustworthy multimodal AI.
Program
The detailed tutorial program, including talk titles, speaker order, and session timing, will be updated after the final ICPR 2026 tutorial schedule is confirmed.
Invited speakers
Queen Mary University of London
Dr. Ziquan Liu is a Lecturer / Assistant Professor at the School of Electronic Engineering and Computer Science, Queen Mary University of London. He is affiliated with the Centre for Multimodal AI and the Computer Vision Group. His research focuses on reliable and responsible machine learning, trustworthy AI, and multimodal foundation models.
Homepage
Hohai University
Dr. Yirui Wu is a Young Professor at Hohai University, a member of the Hydrology Big Data Group, and the leader of Delta Lab. His research interests include computer vision, artificial intelligence, multimedia computing, and intelligent water conservancy.
Homepage
KTH Royal Institute of Technology
Dr. Siyuan Yang is currently a Wallenberg–NTU Presidential Postdoctoral Fellow in the Department of Robotics, Perception and Learning at KTH Royal Institute of Technology. His research interests include computer vision, action recognition, and human pose estimation.
Tutorial organizers
Lancaster University
Professor / Chair in Digital Health at the School of Computing and Communications, with research interests in computer vision and human-centered AI.
Homepage
University of California, Merced
Assistant Professor in the Department of Computer Science and Director of the UC Merced NLP Lab, focusing on Natural Language Processing and Multimodal Large Language Models.
Homepage
University of Queensland
Lecturer (Assistant Professor) at the School of Electrical Engineering and Computer Science, with research interests in multi-modal understanding and trustworthy large models .
Homepage
University at Buffalo, SUNY
Professor and Director of the Visual Computing Lab at Department of Computer Science and Engineering, focusing on computer vision and video understanding.
HomepageMaterials
Slides will be released before the tutorial.