multimodal data 的空间可解释性

最终分组将多模态数据的空间可解释性研究划分为八个维度。核心研究路径呈现出从底层“几何流形与潜在空间对齐”到中层“垂直行业（医疗、遥感、感知）空间模式提取”，再到高层“大模型空间推理评测与生成控制”的演进趋势。研究重点已从简单的多模态特征融合转向对模型内部空间处理机理的机械解释（如 SAE 应用），以及在具身智能、自动驾驶等物理交互场景中保持时空逻辑的一致性与鲁棒性。

共 122 篇文献，8 个研究方向

多模态大模型的空间定位机制与指代对齐

这类研究探讨多模态大语言模型（MLLMs）如何通过改进 Tokenization、引入坐标回归或解耦感知与推理，实现对 2D/3D 空间细粒度特征的视觉定位（Grounding）与逻辑理解。相关文献: Hongyu Li et. al, 2025 等 13 篇文献

潜空间几何流形与跨模态一致性的理论表征

关注多模态数据在隐空间中的拓扑结构，利用流形学习、几何校准及对比学习构建共享且可解释的嵌入空间，旨在通过数学手段揭示模态间语义鸿沟的弥合机理。相关文献: Nicolas Tacheny et. al, 2026 等 20 篇文献

具身智能与自动驾驶中的三维场景感知与导航

侧重于动态环境中的物理空间建模，利用 3D 点云、Gaussian Splatting 及 BEV（鸟瞰图）技术，实现跨传感器的时空同步、障碍物避让与轨迹预测。相关文献: Lingfan Zheng et. al, 2025 等 19 篇文献

医学影像与生理信号的空间拓扑关联分析

研究如何保持解剖结构的一致性，通过跨尺度（如组织学与转录组）对齐和时空特征融合，提升病灶检测、脑部功能分析等临床诊断的可解释性。相关文献: Tiansong Sheng et. al, 2026 等 19 篇文献

遥感影像与地理空间的异构特征融合与变化检测

针对卫星、SAR 及光学遥感数据，利用注意力机制和几何约束克服空间分辨率不一致及视角偏差，实现精准的地物分类与地理信息解译。相关文献: Han Oh et. al, 2025 等 15 篇文献

生成式模型中的空间布局控制与时空一致性

研究扩散模型等架构在生成图像或视频时，如何通过结构化指令、空间草图或几何引导确保内容在布局和动作逻辑上符合物理常识。相关文献: Guozhen Zhang et. al, 2025 等 10 篇文献

行为分析、工业监控及特定任务的空间模式挖掘

涵盖人体动作识别、微表情、工业缺陷检测等任务，利用时空图卷积（ST-GCN）或多尺度交互捕捉动态环境中的异常特征与语义关联。相关文献: Yin-zu Chen et. al, 2025 等 13 篇文献

空间解释性评测基准与模型内部机理诊断

开发专门针对 3D/6D 空间推理、幻觉缓解及跨模态对齐的评测基准，并利用稀疏自编码器（SAE）等工具探究模型内部特征的物理意义。相关文献: Xingrui Wang et. al, 2025 等 13 篇文献

总计206篇相关文献

Query-Guided Spatial Localization with Multimodal Large Language Models

基于多模态大型语言模型的查询引导空间定位

Zhihan Zhang, Tianle Hu, Dong Yin, 2025-Proceedings of the 4th International Conference on Artificial Intelligence and Intelligent Information Processing

Multimodal large language models have achieved remarkable progress in tasks such as visual understanding, captioning, and reasoning, demonstrating their strong ability to bridge visual and textual modalities. However, spatial localization remains a highly challenging task. Existing approaches typically rely on directly predicting spatial coordinates from large models; however, these numerical outputs lack semantic interpretability and provide little information about how the model connects language to specific regions in the visual input. Moreover, when extending from static images to dynamic videos, the number of predicted spatial coordinates grows rapidly across frames, making temporal alignment with video content difficult. In addition, the large volume of coordinate outputs leads to inefficiency in inference, which significantly limits the applicability of current methods to long or high-resolution videos. To solve the mentioned issues, we design a query-guided spatial localization baseline based on large multimodal models. The key idea is to move away from treating localization as direct coordinate regression and instead leverage semantically meaningful queries to guide the localization process. Specifically, we design spatial-aware queries that capture frame-level spatial cues, and we introduce a query-guided decoder that maps hidden representations of large multimodal models into spatial coordinates. This design not only enables more interpretable localization but also facilitates temporal alignment in videos by associating queries with corresponding frames. Furthermore, it reduces the computational burden by avoiding dense coordinate prediction for every frame. Extensive experiments on both Referring Expression Comprehension and video spatial localization benchmarks demonstrate that our method achieves superior performance compared to state-of-the-art baselines.