multimodal data 的空间可解释性
多模态大模型的空间定位机制与指代对齐
这类研究探讨多模态大语言模型(MLLMs)如何通过改进 Tokenization、引入坐标回归或解耦感知与推理,实现对 2D/3D 空间细粒度特征的视觉定位(Grounding)与逻辑理解。
- LLaVA-ST: A Multimodal Large Language Model for Fine-Grained Spatial-Temporal Understanding(Hongyu Li, Jinyu Chen, Ziyu Wei, Shaofei Huang, Tianrui Hui, Jialin Gao, Xiaoming Wei, Si Liu, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- START: Spatial and Textual Learning for Chart Understanding(Zhuoming Liu, Xiaofeng Gao, Feiyang Niu, Qiaozi Gao, Liu Liu, Robinson Piramuthu, 2025, ArXiv)
- Decoupling the Image Perception and Multimodal Reasoning for Reasoning Segmentation with Digital Twin Representations(Yizhe Li, Dell Zhang, Xuelong Li, Yiqing Shen, 2025, ArXiv)
- Query-Guided Spatial Localization with Multimodal Large Language Models(Zhihan Zhang, Tianle Hu, Dong Yin, 2025, Proceedings of the 4th International Conference on Artificial Intelligence and Intelligent Information Processing)
- Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks(Miran Heo, Min-Hung Chen, De-An Huang, Sifei Liu, Subhashree Radhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, Ryo Hachiuma, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Grounding Everything in Tokens for Multimodal Large Language Models(Xiangxuan Ren, Zhongdao Wang, Liping Hou, Pin Tang, Guoqing Wang, Chao Ma, 2025, ArXiv)
- HMR3D: Hierarchical Multimodal Representation for 3D Scene Understanding with Large Vision-Language Model(Chen Li, Eric Peh, Basura Fernando, 2025, ArXiv)
- SOLIDGEO: Measuring Multimodal Spatial Math Reasoning in Solid Geometry(Peijie Wang, Chao Yang, Zhongzhi Li, Fei Yin, Dekang Ran, Mi Tian, Zhi-Long Ji, Jinfeng Bai, Chenglin Liu, 2025, ArXiv)
- LLM Meets Scene Graph: Can Large Language Models Understand and Generate Scene Graphs? A Benchmark and Empirical Study(Dongil Yang, Minjin Kim, Sunghwan Kim, Beong-woo Kwak, Minjun Park, Jinseok Hong, Woontack Woo, Jinyoung Yeo, 2025, ArXiv)
- Where, Not What: Compelling Video LLMs to Learn Geometric Causality for 3D-Grounding(Yutong Zhong, 2025, ArXiv)
- Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage(Junfei Xie, Peng Pan, Xulong Zhang, 2026, ArXiv)
- Injecting Cross-modal Fine-Grained Perception into LLMs for 3D Object-of-Interest Understanding(Qianqian Sun, Lu Shi, Linna Zhang, Gaoyun An, Yi Jin, Yidong Li, Yigang Cen, 2025, 2025 IEEE International Conference on Multimedia and Expo (ICME))
- Enhancing Spatial Reasoning in Multimodal Vision-Language Models via Depth-Aware Feature Integration(Hiroo Tsuji, 2025, 2025 IEEE/ACIS 29th International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD))
潜空间几何流形与跨模态一致性的理论表征
关注多模态数据在隐空间中的拓扑结构,利用流形学习、几何校准及对比学习构建共享且可解释的嵌入空间,旨在通过数学手段揭示模态间语义鸿沟的弥合机理。
- Calibrated Similarity for Reliable Geometric Analysis of Embedding Spaces(Nicolas Tacheny, 2026, ArXiv)
- REMA: A Unified Reasoning Manifold Framework for Interpreting Large Language Model(Bo Li, Guanzhi Deng, Ronghao Chen, Junrong Yue, Shuo Zhang, Qing Xia Zhao, Linqi Song, Lijie Wen, 2025, ArXiv)
- Event2Vec: A Geometric Approach to Learning Composable Representations of Event Sequences(Antonin Sulc, 2025, ArXiv)
- Analytical Discovery of Manifold with Machine Learning(Yafei Shen, Huan-Fei Ma, Ling Yang, 2025, ArXiv)
- A Geometric Unification of Generative AI with Manifold-Probabilistic Projection Models(Leah Bar, L. Yosef, Shai Zucker, N. Shoham, Inbar Seroussi, N. Sochen, 2025, ArXiv)
- scMAG: Integrating single-cell multi-omics data via multi-stage deep fusion with manifold-aware gating.(Shuangquan Li, Junhao Zou, 2026, Computational biology and chemistry)
- SimE: A Knowledge Graph Embedding Model to Encode Self-Similar Structures Through Algebraic and Geometric Transformations(K. Amouzouvi, Yasharajsinh Chudasama, Disha Purohit, Ariam Rivas, Bowen Song, Jens Lehmann, Sahar Vahdati, Maria-Esther Vidal, 2025, IEEE Access)
- Integrating Large Language Models and Möbius Group Transformations for Temporal Knowledge Graph Embedding on the Riemann Sphere(Sensen Zhang, Xun Liang, Simin Niu, Zhendong Niu, Bo Wu, Gengxin Hua, Longzheng Wang, Zhenyu Guan, Hanyu Wang, Xuan Zhang, Zhiyu Li, Yuefeng Ma, 2025, No journal)
- JE-IRT: A Geometric Lens on LLM Abilities through Joint Embedding Item Response Theory(Louie Hong Yao, Nicholas Jarvis, Tiffany Zhan, Saptarshi Ghosh, Linfeng Liu, Tianyu Jiang, 2025, ArXiv)
- High-dimensional multimodal uncertainty estimation by manifold alignment:Application to 3D right ventricular strain computations(Maxime Di Folco, Gabriel Bernardino, Patrick Clarysse, Nicolas Duchateau, 2025, ArXiv)
- Cross-Modal Retrieval via Contrastive Representation Learning of Images and Text Descriptions(Zhaoxuan Li, Nan Tang, 2025, Int. J. Pattern Recognit. Artif. Intell.)
- Multi-Semantic Embedding Hashing for LargeScale Cross-Modal Retrieval(Zhiying Cui, Hongbin Ma, Yingli Wang, 2025, 2025 4th International Joint Conference on Information and Communication Engineering (JCICE))
- Intramodal consistency in triplet-based cross-modal learning for image retrieval(Mario Mallea, Ricardo Ñanculef, Mauricio Araya, 2025, Machine Learning)
- Bridging the Semantic Chasm: Synergistic Conceptual Anchoring for Generalized Few-Shot and Zero-Shot OOD Perception(Alexandros Christoforos, S. Jenkins, Michael Brown, Tuan Pham, David L. Chen, 2026, ArXiv)
- Rectified LpJEPA: Joint-Embedding Predictive Architectures with Sparse and Maximum-Entropy Representations(Yilun Kuang, Yash Dagade, Tim G. J. Rudner, Randall Balestriero, Yann LeCun, 2026, ArXiv)
- Collaboratively Semantic Alignment and Metric Learning for Cross-Modal Hashing(Jiaxing Li, W. Wong, Lin Jiang, Kaihang Jiang, Xiaozhao Fang, Shengli Xie, Jie Wen, 2025, IEEE Transactions on Knowledge and Data Engineering)
- Calibrated Multimodal Representation Learning with Missing Modalities(Xiaohao Liu, Xiaobo Xia, Jiaheng Wei, Shuo Yang, Xiu Su, See-Kiong Ng, Tat-Seng Chua, 2025, ArXiv)
- CMLFA: cross-modal latent feature aligning for text-to-image person re-identification(Xiaofa Yang, Jianming Wang, Yukuan Sun, Xiaojie Duan, 2025, Journal of Electronic Imaging)
- Interpreting the Linear Structure of Vision-language Model Embedding Spaces(Isabel Papadimitriou, Huangyuan Su, Thomas Fel, Naomi Saphra, S. Kakade, Stephanie Gil, 2025, ArXiv)
- Calibrating Biased Distribution in VFM-derived Latent Space via Cross-Domain Geometric Consistency(Yanbiao Ma, Wei Dai, Bo Liu, Jiayi Chen, Wenke Huang, Guancheng Wan, Zhiwu Lu, Junchi Yan, 2025, ArXiv)
具身智能与自动驾驶中的三维场景感知与导航
侧重于动态环境中的物理空间建模,利用 3D 点云、Gaussian Splatting 及 BEV(鸟瞰图)技术,实现跨传感器的时空同步、障碍物避让与轨迹预测。
- GARNET: Gaussian Feature Rendering Network for 3D Object Classification(Lingfan Zheng, Yifan Liu, Zhen Xiao, Jianbin Jiao, Yanzhao Zhou, 2025, 2025 4th International Conference on Image Processing, Computer Vision and Machine Learning (ICICML))
- CoRe3D: Collaborative Reasoning as a Foundation for 3D Intelligence(Tianjiao Yu, Xinzhuo Li, Yifan Shen, Yuanzhe Liu, Ismini Lourentzou, 2025, ArXiv)
- AffordBot: 3D Fine-grained Embodied Reasoning via Multimodal Large Language Models(Xinyi Wang, Xun Yang, Yanlong Xu, Yuchen Wu, Zhen Li, Na Zhao, 2025, ArXiv)
- SpaceMind: Camera-Guided Modality Fusion for Spatial Reasoning in Vision-Language Models(Ruosen Zhao, Zhikang Zhang, Jialei Xu, Jiahao Chang, Dong Chen, Lingyun Li, Weijian Sun, Zizhuang Wei, 2025, ArXiv)
- HSTI: A Light Hierarchical Spatial-Temporal Interaction Model for Map-Free Trajectory Prediction(Xiaoyang Luo, Shuaiqi Fu, Bolin Gao, Yanan Zhao, Huachun Tan, Ze Song, 2025, IEEE Transactions on Intelligent Transportation Systems)
- SpatiaLoc: Leveraging Multi-Level Spatial Enhanced Descriptors for Cross-Modal Localization(Tianyi Shang, Pengjie Xu, Zhaojun Deng, Zhenyu Li, Zhicong Chen, Lijun Wu, 2026, ArXiv)
- Spatiotemporal Graph Networks for Relational Reasoning in Campus Infrastructure Management(Sanjay Agal, Krishna M Raulji, Nikunj Bhavsar, Pooja Bhatt, 2025, International Journal of Advanced Computer Science and Applications)
- MDNet: Multimodal Cooperative Perception via Spatial Alignment of Modal Decision-Making(Junyang He, Xiaoheng Deng, Jinsong Gui, Tao Zhang, Xiangjian He, 2025, IEEE Internet of Things Journal)
- DQTP: A Robot Autonomous Task Planner in Open Environments Based on Qwen2-VL model(Yuanjin Qu, Xiangtao Hu, 2025, Proceedings of the 2025 2nd International Conference on Industrial Automation and Robotics)
- BEVTraj: Map-Free End-to-End Trajectory Prediction in Bird's-Eye View with Deformable Attention and Sparse Goal Proposals(Minsang Kong, Myeongjun Kim, Sang Gu Kang, Sang Hun Lee, 2025, ArXiv)
- Multimodal sensor fusion with cross-modal alignment and attention mechanism for enhanced object detection in autonomous driving systems(Piaopiao Qin, Qien Gao, Feng Jiang, Hongjian Zhang, Yi Huang, 2025, No journal)
- BeLLA: End-to-End Birds Eye View Large Language Assistant for Autonomous Driving(Karthik Mohan, Sonam Singh, Amit Arvind Kale, 2025, ArXiv)
- Lang3D-XL: Language Embedded 3D Gaussians for Large-scale Scenes(Shai Krakovsky, Gal Fiebelman, Sagie Benaim, Hadar Averbuch-Elor, 2025, Proceedings of the SIGGRAPH Asia 2025 Conference Papers)
- PolarGFusion3D: Polar Graph Fusion Network for Enhanced Multimodal 3D Perception in Intelligent Vehicles(Lu Li, Chao Wei, 2025, IEEE Transactions on Intelligent Vehicles)
- Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System(Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, Yong Li, 2025, ArXiv)
- Vehicle-to-Infrastructure Collaborative Spatial Perception via Multimodal Large Language Models(Kimia Ehsani, Walid Saad, 2025, ArXiv)
- Integrated Multimodal Perception and Predictive Motion Forecasting via Cross-Modal Adaptive Attention(Bakhita Salman, Alexander Chávez, Muneeb Yassin, 2026, Future Transportation)
- DCI-PRNet: 3D Object Detection Network via Dual Cross-modal Interaction and Progressive Reasoning(Sixian Chan, Beibei Duan, Xinggang Fan, Jie Hu, 2025, 2025 International Joint Conference on Neural Networks (IJCNN))
- HGACNet: Hierarchical Graph Attention Network for Cross-Modal Point Cloud Completion(Yadan Zeng, Jiadong Zhou, Xiaohan Li, I-Ming Chen, 2025, ArXiv)
医学影像与生理信号的空间拓扑关联分析
研究如何保持解剖结构的一致性,通过跨尺度(如组织学与转录组)对齐和时空特征融合,提升病灶检测、脑部功能分析等临床诊断的可解释性。
- VAMF-Net: multimodal fusion and multiscale attention for 3D brain tumor segmentation(Tiansong Sheng, Beibei Hou, 2026, No journal)
- SCDM: Unified Representation Learning for EEG-to-fNIRS Cross-Modal Generation in MI-BCIs(Yisheng Li, Yishan Wang, Baiying Lei, Shuqiang Wang, 2025, IEEE Transactions on Medical Imaging)
- Multimodal deep learning with anatomically constrained attention for screening MRI-detectable TMJ abnormalities from panoramic images(Hyo-Jung Jung, Dayun Ju, Chanyoung Kim, Seong Jae Hwang, Chena Lee, Younjung Park, 2026, NPJ Digital Medicine)
- Hybrid CNN-Graph Attention Networks for Diabetic Retinopathy Grading: A Multimodal Feature Fusion Approach(Vamshi Krishna Pandugula, Abhishek Choudhary, Ravi Uyyala, Padmavathi Vurubindi, 2025, 2025 3rd International Conference on Inventive Computing and Informatics (ICICI))
- Cross-modal dual-domain bi-direction feature interaction network for medical imaging semantic segmentation(Tao Zhou, Qitao Liu, Ke Song, Wenwen Chai, Kaixiong Chen, Huiling Lu, 2025, Scientific Reports)
- DSMFF-UNet: A dual-stream U-Net network based on multimodal feature fusion for EEG depression recognition(Yitong Li, Lu Yuan, 2025, 2025 International Conference on Signal Processing, Computer Networks and Communications (SPCNC))
- Fusion Analysis of EEG-fNIRS Multimodal Brain Signals: A Multitask Classification Algorithm Incorporating Spatial-Temporal Convolution and Dual Attention Mechanisms(Xingbin Shi, Haiyan Wang, Baojiang Li, Yuxin Qin, Cheng Peng, Yifan Lu, 2025, IEEE Transactions on Instrumentation and Measurement)
- StackTrans–Multimodal Heart Disease Detection Using Stacked Transformer Fusion Framework(Muhammad Adnan, Yang Yi, Enci Wang, Md Nasir Imtiaz, 2025, IEEE Access)
- CS2former: Multimodal feature fusion transformer with dual channel-spatial feature extraction module for bipolar disorder diagnosis(Guoxin Wang, Fengmei Fan, Shipeng Dai, Shan An, Chao Zhang, Sheng Shi, Yunan Mei, Feng Yu, Qi Wang, Xiaole Han, Shuping Tan, Yunlong Tan, Zhiren Wang, 2025, Computerized medical imaging and graphics : the official journal of the Computerized Medical Imaging Society)
- MDMF-Net: Multi-Dimensional Integrated Multimodal Feature Fusion Alzheimer's Disease Prediction Network*(Jiahao Mei, Yuhang Peng, Zicheng Zhang, Huabin Wang, 2025, 2025 10th International Conference on Signal and Image Processing (ICSIP))
- CRISP-SAM2: SAM2 with Cross-Modal Interaction and Semantic Prompting for Multi-Organ Segmentation(Xinlei Yu, Changmiao Wang, Hui Jin, Ahmed Elazab, Gangyong Jia, Xiang Wan, Changqing Zou, Ruiquan Ge, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- RTGMFF: Enhanced fMRI-Based Brain Disorder Diagnosis via ROI-Driven Text Generation and Multimodal Feature Fusion(Junhao Jia, Yifei Sun, Yunyou Liu, Cheng Yang, Changmiao Wang, Feiwei Qin, Yong Peng, Wenwen Min, 2025, 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM))
- Predicting fine-grained cell types from histology images through cross-modal learning in spatial transcriptomics(Chaoyang Yan, Zhihan Ruan, Song Chen, Yichen Pan, Xue Han, Yuanyu Li, Jian Liu, 2025, Bioinformatics)
- Integrating histology and spatial transcriptomics via multimodal transformers and contrastive representation learning for accurate gene expression prediction.(Kai Wang, Li Shi, Xue Li, Wei Li, Bin Wang, Shihua Zhou, Ben Cao, Pan Zheng, 2026, Journal of biomedical informatics)
- MedITok: A Unified Tokenizer for Medical Image Synthesis and Interpretation(Chenglong Ma, Yuanfeng Ji, Jin Ye, Zilong Li, Chenhui Wang, Junzhi Ning (Raymond) Ning, Wei Li, Lihao Liu, Qiushan Guo, Tian-Xin Li, Junjun He, Hongming Shan, 2025, ArXiv)
- MedXAI-MM: A Unified Multi-Modality Explainable Artificial Intelligence Framework for Clinical Medical Imaging(Olfa Ghribi, M. Kharrat, Mohamed Chaabane, 2025, 2025 IEEE 22nd International Conference on Sciences and Techniques of Automatic Control and Computer Engineering (STA))
- Detection of Retinal Dysfunction with Multimodal PERG Analysis: A Patient-Level Hybrid Machine Learning Framework(Yavuz Bahadır Koca, 2026, Engineering Perspective)
- Multimodal meta-learning for lung nodule classification under few-shot settings: a trustworthy AI framework with 2D–3D cross-modal alignment(Juhi Gupta, Monica Mehrotra, Arpita Aggarwal, 2026, Pattern Analysis and Applications)
遥感影像与地理空间的异构特征融合与变化检测
针对卫星、SAR 及光学遥感数据,利用注意力机制和几何约束克服空间分辨率不一致及视角偏差,实现精准的地物分类与地理信息解译。
- KOMPSAT-3/3A Image-text Dataset for Training Large Multimodal Models(Han Oh, Donghyun Shin, Daewon Chung, 2025, GEO DATA)
- Hyperspectral Unmixing Based on Dual-Graph Manifold Regularization: Joint Preservation of Spatial-Spectral Geometric Structure(Xiaojuan Luo, Kewen Qu, 2025, 2025 6th International Conference on Geology, Mapping and Remote Sensing (ICGMRS))
- Dual Feature Enhancement and Adaptive Attention Fusion for Cross-Modal Scene Classification of Mining Land(Yue Zhou, Jiangyuan Wang, Xianju Li, 2025, 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC))
- A Multimodal Semantic Segmentation Framework for Heterogeneous Optical and Complex SAR Data(Sining Xiao, Peijin Wang, Wenhui Diao, Kun Fu, Xian Sun, 2025, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing)
- Cross-modal feature interaction network for heterogeneous change detection(Zhiwei Yang, Xiaoqin Wang, Haihan Lin, Mengmeng Li, Mengjing Lin, 2025, Geo-spatial Information Science)
- Multimodal Feature-Enhanced Unet for Forward-Looking Sonar Segmentation(Zefan Wu, Wei Li, Xiaoguang Chen, Lin Mei, Ye-Qiong Wang, 2025, 2025 IEEE 102nd Vehicular Technology Conference (VTC2025-Fall))
- DF2RQ: Dynamic Feature Fusion via Region-Wise Queries for Semantic Segmentation of Multimodal Remote Sensing Data(Shiyang Feng, Zhaowei Li, Bo Zhang, Bin Wang, 2025, IEEE Transactions on Geoscience and Remote Sensing)
- A Vision Centric Remote Sensing Benchmark(Abduljaleel Adejumo, Faegheh Yeganli, Clifford Broni-Bediako, Aoran Xiao, Naoto Yokoya, Mennatullah Siam, 2025, ArXiv)
- Cross-Modal Contrastive Pansharpening via Uncertainty Guidance(Haoying Zeng, Xiaoyuan Yang, Kangqing Shen, Yixiao Li, Jin Jiang, Fangyi Li, 2025, IEEE Transactions on Geoscience and Remote Sensing)
- Robust Multimodal Road Extraction via Dual-Layer Evidential Fusion Networks for Remote Sensing(Hui Wang, You-Sun Huang, Yu Wang, Donglai Jiao, Hao Huang, Yun Lin, Guan Gui, 2025, IEEE Transactions on Geoscience and Remote Sensing)
- A Mamba-Aware Spatial–Spectral Cross-Modal Network for Remote Sensing Classification(Mengru Ma, Jiaxuan Zhao, Wenping Ma, Licheng Jiao, Lingling Li, Xu Liu, Fang Liu, Shuyuan Yang, 2025, IEEE Transactions on Geoscience and Remote Sensing)
- SCIAU-Net: A Spatial-Spectral Cross-Modal Interaction ADMM Unfolding Network for Hyperspectral and Multispectral Image Fusion(Ruiqing Zhang, Bingbing Lei, Wei Feng, X. Chai, 2026, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing)
- Joint Classification of Hyperspectral and LiDAR Data Using Hierarchical Multimodal Feature Aggregation-Based Multihead Axial Attention Transformer(Fei Zhu, Cuiping Shi, Kaijie Shi, Liguo Wang, 2025, IEEE Transactions on Geoscience and Remote Sensing)
- Spatial Uncertainty Quantification in Wildfire Forecasting for Climate-Resilient Emergency Planning(A. Chakravarty, 2025, ArXiv)
- AUTOMATIC SEMANTIC SEGMENTATION OF SENTINEL-2 IMAGES: INTEGRATION OF CLUSTERING AND LARGE MULTIMODAL MODELS FOR CLUSTER INTERPRETATION(O. Honcharov, V. Hnatushenko, 2025, International scientific and technical conference Information technologies in metallurgy and machine building)
生成式模型中的空间布局控制与时空一致性
研究扩散模型等架构在生成图像或视频时,如何通过结构化指令、空间草图或几何引导确保内容在布局和动作逻辑上符合物理常识。
- UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions(Guozhen Zhang, Zixiang Zhou, Teng Hu, Ziqiao Peng, Youliang Zhang, Yi Chen, Yuan Zhou, Qinglin Lu, Limin Wang, 2025, ArXiv)
- Genesis: Multimodal Driving Scene Generation with Spatio-Temporal and Cross-Modal Consistency(Xiangyu Guo, Zhanqian Wu, Kaixin Xiong, Ziyang Xu, Lijun Zhou, Gangwei Xu, Shao-hua Xu, Haiyang Sun, Bing Wang, Guang Chen, Hangjun Ye, Wenyu Liu, Xinggang Wang, 2025, ArXiv)
- Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization(Jialu Li, Shoubin Yu, Han Lin, Jaemin Cho, Jaehong Yoon, Mohit Bansal, 2025, ArXiv)
- Enhancing Text-to-SVG Generation via Structured Instruction Embedding and Syntax-Aware Reinforcement in Large Language Models(Peiqing Lu, Shihao Zhao, Yushang Zhao, Runmian Chang, Yinuo Yang, 2025, 2025 8th International Conference on Computer Information Science and Application Technology (CISAT))
- Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers(Zhengyao Lv, Tianlin Pan, Chenyang Si, Zhaoxi Chen, Wangmeng Zuo, Ziwei Liu, Kwan-Yee K. Wong, 2025, ArXiv)
- Seg2Any: Open-set Segmentation-Mask-to-Image Generation with Precise Shape and Semantic Control(Danfeng li, Hui Zhang, Shenghong Wang, Jiachen Li, Zuxuan Wu, 2025, ArXiv)
- In-Context Brush: Zero-shot Customized Subject Insertion with Context-Aware Latent Space Manipulation(Yu Xu, Fan Tang, You Wu, Lin Gao, Oliver Deussen, Hongbin Yan, Jintao Li, Juan Cao, Tong-Yee Lee, 2025, Proceedings of the SIGGRAPH Asia 2025 Conference Papers)
- Bridging the Discrete-Continuous Gap: Unified Multimodal Generation via Coupled Manifold Discrete Absorbing Diffusion(Yuanfeng Xu, Yuhao Chen, Liang Lin, Guangrun Wang, 2026, ArXiv)
- Improving Classifier-Free Guidance of Flow Matching via Manifold Projection(Jian-feng Cai, Haixia Liu, Zhe Su, Chao Wang, 2026, ArXiv)
行为分析、工业监控及特定任务的空间模式挖掘
涵盖人体动作识别、微表情、工业缺陷检测等任务,利用时空图卷积(ST-GCN)或多尺度交互捕捉动态环境中的异常特征与语义关联。
- 121Real-Time Fall Detection via Spatio-Temporal Collaborative Attention and Multimodal Feature Fusion Based on Deep Learning(Yin-zu Chen, 2025, 2025 6th International Conference on Computer Vision, Image and Deep Learning (CVIDL))
- MAPLE: Modality-Agnostic Prototype Learning for Egocentric Action Recognition(Da Li, Di Zhou, Yishan Zou, Shenghua Li, Meng Liu, 2025, 2025 IEEE International Conference on Multimedia and Expo (ICME))
- A Multimodal Feature Fusion Based Action Normality Matching Algorithm for Online Sports Education(Jie Liu, Fengyang Fu, 2025, 2025 International Conference on Educational Technology Management (ICETM))
- LLaFEA: Frame-Event Complementary Fusion for Fine-Grained Spatiotemporal Understanding in LMMs(Hanyu Zhou, Gim Hee Lee, 2025, ArXiv)
- Unified Visual Synchrony: A Framework for Face–Gesture Coherence in Multimodal Human–AI Interaction(Saule Kudubayeva, Yernar Seksenbayev, A. Yerimbetova, E. Daiyrbayeva, B. Sakenov, Duman Telman, M. Turdalyuly, 2026, Big Data and Cognitive Computing)
- Real-Time Personnel Behavior Detection in Dusty Coal Mines via Dehazing-Enhanced YOLO with Cross-Modal Guidance(Meng Zhou, C. Qin, 2025, Academic Journal of Computing & Information Science)
- Correlation-Driven Multi-Level Multimodal Learning for Anomaly Detection on Smart Electric Grid(Yuhan Dong, 2025, 2025 2nd International Conference on Smart Grid and Artificial Intelligence (SGAI))
- MFCNet: Multimodal Feature Fusion Network for RGB-T Vehicle Density Estimation(Ling-Xiao Qin, Hong-mei Sun, Xiao-Meng Duan, Cheng-Yue Che, Ruisheng Jia, 2025, IEEE Internet of Things Journal)
- Multimodal Industrial Anomaly Detection via Uni-Modal and Cross-Modal Fusion(Hao Cheng, Jiaxiang Luo, Xianyong Zhang, 2025, IEEE Transactions on Industrial Informatics)
- Abnormal behavior detection method based on multimodal feature fusion with attention mechanism(Yuexia Liu, Yunfei Cheng, Wu Wang, 2025, No journal)
- A Multimodal Gait Recognition Method Based on Skeleton Maps and Channel-Prior Convolutional Attention(Dongliang Yang, Changjiang Song, Siwen Sun, 2025, Proceedings of the 2025 International Conference on Computer Technology, Digital Media and Communication)
- SwinET-IoT: A Mask-Guided Multimodal Transformer Framework for Real-Time Emotion Prediction in Intelligent Learning Environments(D. P, G.Thailambal, 2026, 2026 International Conference on Electronics and Renewable Systems (ICEARS))
空间解释性评测基准与模型内部机理诊断
开发专门针对 3D/6D 空间推理、幻觉缓解及跨模态对齐的评测基准,并利用稀疏自编码器(SAE)等工具探究模型内部特征的物理意义。
- Spatial457: A Diagnostic Benchmark for 6D Spatial Reasoning of Large Multimodal Models(Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, Alan Yuille, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Open3DVQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space(Weichen Zhan, Zile Zhou, Zhiheng Zheng, Chen Gao, Jinqiang Cui, Yong Li, Xinlei Chen, Xiao-Ping Zhang, 2025, ArXiv)
- Mitigating Object Hallucinations in Large Vision-Language Models via Attention Calibration(Younan Zhu, Linwei Tao, Minjing Dong, Chang Xu, 2025, ArXiv)
- SAE-V: Interpreting Multimodal Models for Enhanced Alignment(Hantao Lou, Changye Li, Jiaming Ji, Yaodong Yang, 2025, ArXiv)
- Explaining multimodal LLMs via intra-modal token interactions(Jiawei Liang, Ruoyu Chen, Xianghao Jiao, Siyuan Liang, Shiming Liu, Qunli Zhang, Zheng Hu, Xiaochun Cao, 2025, ArXiv)
- HEDGE: Hallucination Estimation via Dense Geometric Entropy for VQA with Vision-Language Models(Sushant Gautam, Michael A. Riegler, Paal Halvorsen, 2025, ArXiv)
- FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation(Jingwei Zuo, Lingzhou Mu, Fan Jiang, Chengcheng Ma, Mu Xu, Yonggang Qi, 2026, ArXiv)
- Multimodal Chain of Continuous Thought for Latent-Space Reasoning in Vision-Language Models(Tan-Hanh Pham, Chris Ngo, 2025, ArXiv)
- Are Multimodal Large Language Models Ready for Omnidirectional Spatial Reasoning?(Zihao Dongfang, Xu Zheng, Ziqiao Weng, Yuanhuiyi Lyu, D. Paudel, L. V. Gool, Kailun Yang, Xuming Hu, 2025, ArXiv)
- MARBLE: A Hard Benchmark for Multimodal Spatial Reasoning and Planning(Yulun Jiang, Yekun Chai, Maria Brbić, Michael Moor, 2025, ArXiv)
- AVQACL: A Novel Benchmark for Audio-Visual Question Answering Continual Learning(Kaixuan Wu, Xinde Li, Xinling Li, Chuanfei Hu, Guoliang Wu, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Bridging the Dynamic Perception Gap: Training-Free Draft Chain-of-Thought for Dynamic Multimodal Spatial Reasoning(Siqu Ou, Hongcheng Liu, Pingjie Wang, Yusheng Liao, Chuan Xuan, Yanfeng Wang, Yu Wang, 2025, No journal)
- Visual Representation Alignment for Multimodal Large Language Models(Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, Chanho Eom, Sung‐Jin Hong, Seungryong Kim, 2025, ArXiv)
最终分组将多模态数据的空间可解释性研究划分为八个维度。核心研究路径呈现出从底层“几何流形与潜在空间对齐”到中层“垂直行业(医疗、遥感、感知)空间模式提取”,再到高层“大模型空间推理评测与生成控制”的演进趋势。研究重点已从简单的多模态特征融合转向对模型内部空间处理机理的机械解释(如 SAE 应用),以及在具身智能、自动驾驶等物理交互场景中保持时空逻辑的一致性与鲁棒性。
总计206篇相关文献
Multimodal large language models have achieved remarkable progress in tasks such as visual understanding, captioning, and reasoning, demonstrating their strong ability to bridge visual and textual modalities. However, spatial localization remains a highly challenging task. Existing approaches typically rely on directly predicting spatial coordinates from large models; however, these numerical outputs lack semantic interpretability and provide little information about how the model connects language to specific regions in the visual input. Moreover, when extending from static images to dynamic videos, the number of predicted spatial coordinates grows rapidly across frames, making temporal alignment with video content difficult. In addition, the large volume of coordinate outputs leads to inefficiency in inference, which significantly limits the applicability of current methods to long or high-resolution videos. To solve the mentioned issues, we design a query-guided spatial localization baseline based on large multimodal models. The key idea is to move away from treating localization as direct coordinate regression and instead leverage semantically meaningful queries to guide the localization process. Specifically, we design spatial-aware queries that capture frame-level spatial cues, and we introduce a query-guided decoder that maps hidden representations of large multimodal models into spatial coordinates. This design not only enables more interpretable localization but also facilitates temporal alignment in videos by associating queries with corresponding frames. Furthermore, it reduces the computational burden by avoiding dense coordinate prediction for every frame. Extensive experiments on both Referring Expression Comprehension and video spatial localization benchmarks demonstrate that our method achieves superior performance compared to state-of-the-art baselines.
Vision Language Models (VLMs) excel at identifying and describing objects but often fail at spatial reasoning. We study why VLMs, such as LLaVA, underutilize spatial cues despite having positional encodings and spatially rich vision encoder features. Our analysis reveals a key imbalance: vision token embeddings have much larger norms than text tokens, suppressing LLM's position embedding. To expose this mechanism, we developed three interpretability tools: (1) the Position Sensitivity Index, which quantifies reliance on token order, (2) the Cross Modality Balance, which reveals attention head allocation patterns, and (3) a RoPE Sensitivity probe, which measures dependence on rotary positional embeddings. These tools uncover that vision tokens and system prompts dominate attention. We validated our mechanistic understanding through targeted interventions that predictably restore positional sensitivity. These findings reveal previously unknown failure modes in multimodal attention and demonstrate how interpretability analysis can guide principled improvements.
This study aims to improve the accuracy and interpretability of large multimodal models (LMMs) specialized in satellite image analysis by constructing an image-text dataset based on KOMPSAT-3/3A imagery and presenting the results of training using this dataset. Conventional LMMs are primarily trained on general images, limiting their ability to effectively interpret the specific characteristics of satellite imagery, such as spectral bands, spatial resolution, and viewing angles. To address this limitation, we developed an image-text dataset, divided into pretraining and finetuning stages, based on the existing KOMPSAT object detection dataset. The pretraining dataset consists of captions summarizing the overall theme and key information of each image. The fine-tuning dataset integrates metadata -including acquisition time, sensor type, and coordinates-with detailed object detection labels to generate six types of question-answer pairs: detailed descriptions, conversations with varying answer lengths, bounding box identification, multiple choice questions, and complex reasoning. This structured dataset enables the model to learn not only the general context of satellite images but also fine-grained details such as object quantity, location, and geographic attributes. Training with the new KOMPSAT-based dataset significantly improved the model’s accuracy in recognizing regional information and object characteristics in satellite imagery. Finetuned models achieved substantially higher accuracy than previous models, surpassing even the GPT-4o model and demonstrating the effectiveness of a domain-specific dataset. The findings of this study are expected to contribute to various remote sensing applications, including automated satellite image analysis, change detection, and object detection.
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood. Existing interpretability research has primarily focused on cross-modal attribution, identifying which image regions the model attends to during output generation. However, these approaches often overlook intra-modal dependencies. In the visual modality, attributing importance to isolated image patches ignores spatial context due to limited receptive fields, resulting in fragmented and noisy explanations. In the textual modality, reliance on preceding tokens introduces spurious activations. Failing to effectively mitigate these interference compromises attribution fidelity. To address these limitations, we propose enhancing interpretability by leveraging intra-modal interaction. For the visual branch, we introduce \textit{Multi-Scale Explanation Aggregation} (MSEA), which aggregates attributions over multi-scale inputs to dynamically adjust receptive fields, producing more holistic and spatially coherent visual explanations. For the textual branch, we propose \textit{Activation Ranking Correlation} (ARC), which measures the relevance of contextual tokens to the current token via alignment of their top-$k$ prediction rankings. ARC leverages this relevance to suppress spurious activations from irrelevant contexts while preserving semantically coherent ones. Extensive experiments across state-of-the-art MLLMs and benchmark datasets demonstrate that our approach consistently outperforms existing interpretability methods, yielding more faithful and fine-grained explanations of model behavior.
The Visual Question Answering (VQA) task requires not only accurate answers but also interpretable reasoning processes, particularly in real-world applications where transparency is critical. To reduce annotation and computational costs while maintaining interpretability, the Few-shot Multimodal Explainable VQA (FS-MEVQA) task has been introduced, which aims to generate explanations with limited supervision. In this work, we propose OPeMer (One-shot Prompting and Execution-driven Multimodal Explainable Reasoning), a code-based framework that leverages large language models (LLMs) to generate executable Python programs for multimodal reasoning in a oneshot setting. These programs interact with a lightweight Python API to process visual inputs, capture intermediate reasoning artifacts—such as object crops and spatial relations—and optionally call external tools for open-world visual understanding. The resulting execution traces are serialized and provided to the LLM via a secondary prompt, enabling the generation of coherent multimodal explanations grounded in both visual and textual evidence. Designed without reliance on handcrafted rules or large-scale supervision, OPeMer offers an efficient and extensible approach to explainable multimodal reasoning. Experimental results on the SME dataset demonstrate that OPeMer achieves strong answer accuracy and explanation quality, even when using cost-effective LLMs under limited supervision, suggesting its potential for scalable and interpretable VQA.
Semantic segmentation of satellite imagery, particularly Sentinel-2 data, is crucial for environmental monitoring and land cover mapping. This paper presents an unsupervised method for land cover classification that eliminates the need for pixel-level annotations. The approach combines clustering techniques (K-Means, DBSCAN, autoencoders) with automated cluster labeling using large vision-language models (e.g., GPT-4, Claude, Gemini 2.0). Clusters are visualized and interpreted by these models based on spatial context and color. The methodology achieves segmentation accuracy of 85–90%, comparable to supervised methods, while ensuring interpretability and scalability. A majority voting mechanism and terminology normalization improve consistency across model outputs. Validation is performed using ESA WorldCover maps. The proposed approach is promising for rapid land cover mapping in resource-constrained or emergency situations.
Traditional anchor graph clustering (AGC) methods usually perform suboptimally when dealing with subspace similarity caused by spectral mixing and typically lack physical interpretability during anchor selection, relying on complex postprocessing steps. To address this issue, we introduce hyperspectral (HS) unmixing (HU) into the AGC framework, showing their inherent equivalence. Specifically, we proposed the subpixel AGC (SAGC) method, which explicitly models subpixel information from mixed spectra to endmembers for clustering. It allows seamless migration of HU methods to AGC tasks for determining the number of anchors and guiding anchor selection. To enhance anchor diversity while preserving subpixel information, we design the maximum anchor diversity selection strategy (MADSS). Finally, we apply this framework to HS–light detection and ranging (LiDAR) AGC, providing an implicit spatial regularization method based on unified abundances and an efficient solving strategy. To improve robustness in complex scenarios, we extend the method to its deep counterpart [deep SAGC (DSAGC)], effectively modeling spatial information and nonlinear features. The experimental results show that SAGC achieves performance comparable to state-of-the-art (SOTA) methods on three HS–LiDAR datasets, while DSAGC delivers significant improvements. Code is available at: https://github.com/Liujehong/SAGC
No abstract available
Many neurology related vision disorders like cortical visual impairment (CVI), hemianopia, and visual agnosia and many more pose diagnostic challenges because of the varied manifestations across the structural, functionaland the vascular domains of the brain. Conventional clinical practicessolelyrely only on single imaging modalities but most of the time itfails to capture the complex connections between these dimensions. Hence, in this regardthis paper proposes a pattern-driven AI framework that synthesis main four brain imaging modalitiessuch as MRI, fMRI, DTIand MRAinto a unified diagnostic space pipeline. The proposed system efficiently addresses the key barriers to effective integration, including modality heterogeneity, spatial and temporal misalignment, missing modalities, and the complexity of fusion strategy design. Through modality-specific pre-processingnormalized feature representationand attention-based fusion, the proposed framework captures clinically relevant patterns by maintaining the interpretability. Also, evaluated on benchmark and simulated datasets, the system had shown improved diagnostic accuracy and robustnessoffering a promising direction for early vision defect detection.
Precipitable water vapor (PWV) is a crucial atmospheric variable that influences weather systems, climate variability, and hydrological processes. Accurate PWV estimation is essential for improving numerical weather prediction, climate modeling, and remote‐sensing applications. However, existing methods often rely on extensive meteorological inputs or computationally intensive architectures, limiting their applicability in data‐sparse regions. This study introduces a novel hybrid framework, EMMA–NN–BiGRU–XGBoost, designed to forecast monthly mean PWV across Turkey using only four physically meaningful inputs: latitude, longitude, altitude, and seasonal indicators. The framework integrates an enhanced multimodal attention (EMMA) mechanism that disentangles spatial, altitudinal, and seasonal influences, improving interpretability and physical consistency. Bidirectional gated recurrent units (BiGRU) capture temporal dependencies, and XGBoost models nonlinear feature interactions within a weighted stacking ensemble. Hyperparameters are optimized via particle swarm optimization and Bayesian optimization, with particle swarm optimization demonstrating superior tuning efficiency. Extensive benchmarking against traditional machine‐learning models, using grid search and random search with fivefold cross‐validation, as well as deep‐learning baselines, demonstrates significant improvements in predictive accuracy, achieving an root‐mean‐square error of and an of 0.92, representing a 15%–20% reduction in error compared with state‐of‐the‐art methods. The model also exhibits robustness across diverse climatic zones in Turkey. Shapley additive explanations further elucidate feature importance, aligning model outputs with climatological principles. Beyond methodological advances, this work provides a scalable, interpretable, and data‐efficient baseline for PWV forecasting, thereby facilitating enhanced climate diagnostics, hydrological risk assessments, and early warning systems, particularly in regions with limited meteorological observations.
No abstract available
Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning. Moreover, multimodal extensions like OctoNav-R1 and CoT-VLA further validate CoT as a promising pathway toward human-like navigation reasoning. However, existing approaches face critical drawbacks: purely textual CoTs lack spatial grounding and easily overfit to sparse annotated reasoning steps, while multimodal CoTs incur severe token inflation by generating imagined visual observations, making real-time navigation impractical. In this work, we propose FantasyVLN, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead. Specifically, imagined visual tokens are encoded into a compact latent space using a pretrained Visual AutoRegressor (VAR) during CoT reasoning training, and the model jointly learns from textual, visual, and multimodal CoT modes under a unified multi-CoT strategy. At inference, our model performs direct instruction-to-action mapping while still enjoying reasoning-aware representations. Extensive experiments on LH-VLN show that our approach achieves reasoning-aware yet real-time navigation, improving success rates and efficiency while reducing inference latency by an order of magnitude compared to explicit CoT methods.
Real-time emotion understanding in intelligent learning environments is becoming an increasingly important requirement in classrooms integrating visual, audio, physiological and ambient IoT data. However, existing unimodal, or low context models of affect struggle with temporal instability, low interpretability and inefficient deployment. This work is aimed to solve a problem of robust multimodal emotion prediction (across seven affective states) using heterogeneous signals with real-time constraint. The task then is to offer true, negligible and explainable anticipations that can be pushed to the edge. The study proposes a SwinET-IoT, which combines a multimodal emotion recognition framework based on the extended Mask R-CNN and Swin Transformer backbone, lightweight audio and physiological encoders, IoT telemetry processing, and temporal attention transformer and cross-modal co-attention. The information from the critical facial and posture areas will be guaranteed to be provided to the fusion stage by mask-guided spatial attention so that the fusion stage will be provided by hierarchical multimodal fusion and temporal consistency losses to increase the robustness. Edge oriented optimization using pruning, distillation and quantization using 8-bit to reduce the footprint with minimal drawback to accuracy. Experiments conducted on the CRAFT multimodal classroom dataset demonstrate high improvements (92% accuracy, 0.91 (macro-F1) and 0.96 (AUC) and 42 ms/frame inference latency on edge device) than five state-of-the-art baselines. These results confirm that through combining spatially guided multimodal fusion with modeling in time and IoT context efficient, stable, and interpretable emotion prediction can be achieved that can be used for real-world emotion prediction in classrooms for efficient analytics and adaptive learning systems.
No abstract available
Early diagnosis of temporomandibular disorders is challenging. Particularly, intra-articular temporomandibular joint (TMJ) abnormalities can only be confirmed using magnetic resonance imaging (MRI). This study aimed to develop a comprehensive screening method for MRI-detectable TMJ pathologies. We developed an interpretable deep learning framework that leveraged paired open- and closed-mouth TMJ panoramic radiographs and structured clinical metadata. The architecture integrated anatomically guided attention, multimodal clinical features, and ensemble learning for enhanced diagnostic accuracy and interpretability. Across 1355 patients (2710 joints), the best-performing ensemble framework achieved an area under the curve of 0.86, with a balanced classification of MRI-negative and -positive cases. Gradient-weighted Class Activation Mapping visualizations confirmed a consistent focus on the condylar regions, and ablation studies demonstrated the added value of clinical metadata and spatial attention. In conclusion, our prototype workflow can be useful to triage TMJ patients for MRI referral, thus supporting early detection of TMJ abnormalities and timely interventions.
Vision-and-Language Navigation (VLN) in large-scale urban environments requires embodied agents to ground linguistic instructions in complex scenes and recall relevant experiences over extended time horizons. Prior modular pipelines offer interpretability but lack unified memory, while end-to-end (M)LLM agents excel at fusing vision and language yet remain constrained by fixed context windows and implicit spatial reasoning. We introduce \textbf{Mem4Nav}, a hierarchical spatial-cognition long-short memory system that can augment any VLN backbone. Mem4Nav fuses a sparse octree for fine-grained voxel indexing with a semantic topology graph for high-level landmark connectivity, storing both in trainable memory tokens embedded via a reversible Transformer. Long-term memory (LTM) compresses and retains historical observations at both octree and graph nodes, while short-term memory (STM) caches recent multimodal entries in relative coordinates for real-time obstacle avoidance and local planning. At each step, STM retrieval sharply prunes dynamic context, and, when deeper history is needed, LTM tokens are decoded losslessly to reconstruct past embeddings. Evaluated on Touchdown and Map2Seq across three backbones (modular, state-of-the-art VLN with prompt-based LLM, and state-of-the-art VLN with strided-attention MLLM), Mem4Nav yields 7-13 pp gains in Task Completion, sufficient SPD reduction, and>10 pp nDTW improvement. Ablations confirm the indispensability of both the hierarchical map and dual memory modules. Our codes are open-sourced via https://github.com/tsinghua-fib-lab/Mem4Nav.
Cross-modal retrieval is a promising technique nowadays to find semantically similar instances in other modalities while a query instance is given from one modality. However, there still exists many challenges for reducing heterogeneous modality gap by embedding label information to discrete hash codes effectively, solving the binary optimization when generating unified hash codes and reducing the discrepancy of data distribution efficiently during common space learning. In order to overcome the above-mentioned challenges, we propose a Collaboratively Semantic alignment and Metric learning for cross-modal Hashing (CSMH) in this paper. Specifically, by a kernelization operation, CSMH first extracts the non-linear data features for each modality, which are projected into a latent subspace to align both marginal and conditional distributions simultaneously. Then, a maximum mean discrepancy-based metric strategy is customized to mitigate the distribution discrepancies among features from different modalities. Finally, semantic information obtained from the label similarity matrix, is further incorporated to embed the latent semantic structure into the discriminant subspace. Experimental results of CSMH and baseline methods on four widely-used datasets show that CSMH outperforms some state-of-the-art hashing baseline methods for cross-modal retrieval on efficiency and precision.
Cross-modal retrieval, as an emerging field within multimedia research, has gained significant attention in recent years. Unsupervised cross-modal hashing methods are attractive due to their ability to capture latent relationships within the data without label supervision and to produce compact hash codes for high search efficiency. However, the text modality exhibits worse representation ability compared with the image modality, leading to weak guidance to construct the joint similarity matrix. Moreover, most unsupervised cross-modal hashing methods are based on pairwise similarities for training, resulting in non-aggregating data distribution in the hash space. In this paper, we propose a novel Vision-guided Text Mining for Unsupervised Cross-modal Hashing via Community Similarity Quantization, termed VTM-UCH. Specifically, we first find the one-to-one correspondence between each word and each vision (image or object) based on the Contrastive Language-Image Pre-training (CLIP) model and compute the text similarities according to the clustering of their corresponding visions. Then, we define the fine-grained object-level image similarities and design the joint similarity matrix based on the text and image similarities. Accordingly, we construct an undirected graph to compute the communities as the pseudo-centers and adjust the pairwise similarities to improve the hash codes distribution. The experimental results on two common datasets verify the accuracy improvements in comparison with state-of-the-art baselines.
Few-shot cross-modal retrieval focuses on learning cross-modal representations with limited training samples, enabling the model to handle unseen classes during inference. Unlike traditional cross-modal retrieval tasks, which assume that both training and testing data share the same class distribution, few-shot retrieval involves data with sparse representations across modalities. Existing methods often fail to adequately model the multi-peak distribution of few-shot cross-modal data, resulting in two main biases in the latent semantic space: intra-modal bias, where sparse samples fail to capture intra-class diversity, and inter-modal bias, where misalignments between image and text distributions exacerbate the semantic gap. These biases hinder retrieval accuracy. To address these issues, we propose a novel method, GCRDP, for few-shot cross-modal retrieval. This approach effectively captures the complex multi-peak distribution of data using a Gaussian Mixture Model (GMM) and incorporates a multi-positive sample contrastive learning mechanism for comprehensive feature modeling. Additionally, we introduce a new strategy for cross-modal semantic alignment, which constrains the relative distances between image and text feature distributions, thereby improving the accuracy of cross-modal representations. We validate our approach through extensive experiments on four benchmark datasets, demonstrating superior performance over six state-of-the-art methods.
Cross-modal retrieval aims to bridge the semantic gap between heterogeneous modalities—such as images and text—by learning a shared embedding space for semantically aligned representation. While recent models have achieved impressive performance using large-scale contrastive pretraining and multimodal transformers, several fundamental challenges remain unresolved. These include the lack of interpretable latent alignment, vulnerability to distribution shifts, and instability in semantic correspondence across tasks and domains. In this paper, we propose a novel contrastive representation learning framework designed to enhance both the robustness and interpretability of cross-modal retrieval. Our method incorporates a hierarchical dual-stream encoder that preserves modality-specific structures while enabling semantic interaction through a conceptaligned projection layer. The model is optimized via a contrastive loss with semanticaware calibration, encouraging consistent feature correspondence across modalities. We provide a rigorous theoretical analysis of the latent projection space, and demonstrate through extensive experiments on MS-COCO, Flickr30K, and RSICD that our approach outperforms strong baselines not only in retrieval accuracy but also in robustness under noise and interpretability via semantic stability selection. The proposed framework is further validated through ablation studies that isolate the contributions of architectural components and training strategies. Our results confirm that semantic disentanglement and hierarchical encoding jointly improve retrieval quality, cross-domain generalization, and feature transparency. The framework offers a scalable and theoretically grounded solution for reliable and explainable multimodal retrieval.
Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space. Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance, making it challenging to utilize prevalent datasets with missing modalities. We provide theoretical insights into this issue from an anchor shift perspective. Observed modalities are aligned with a local anchor that deviates from the optimal one when all modalities are present, resulting in an inevitable shift. To address this, we propose CalMRL for multimodal representation learning to calibrate incomplete alignments caused by missing modalities. Specifically, CalMRL leverages the priors and the inherent connections among modalities to model the imputation for the missing ones at the representation level. To resolve the optimization dilemma, we employ a bi-step learning method with the closed-form solution of the posterior distribution of shared latents. We validate its mitigation of anchor shift and convergence with theoretical guidance. By equipping the calibrated alignment with the existing advanced method, we offer new flexibility to absorb data with missing modalities, which is originally unattainable. Extensive experiments and comprehensive analyses demonstrate the superiority of CalMRL. Our code, model checkpoints, and evaluation raw data will be publicly available.
In this paper, we address the novel task of egocentric modality generalization action recognition, which aims to learn a unified discrete representation from paired multimodal egocentric action data during pre-training. This approach enables cross-modal zero-shot generalization in downstream tasks, where the modalities available during inference and training are disjoint. While recent efforts have focused on aligning instance-level or temporal features to reduce feature distribution discrepancies across modalities, they have overlooked the inherent structural categorization within action data. To address this limitation, we propose Modal-Agnostic Prototype Learning (MAPLE), a framework that leverages a prototype memory bank to capture categorical structures. This is further enhanced by a robust semantic disentanglement module and a moment aggregation mechanism, enabling semantically similar behaviors to cluster more closely in the latent space and promoting robust cross-modal generalization. Extensive experiments on the Ego4D and WEAR datasets demonstrate that MAPLE significantly outperforms existing approaches, marking a substantial advancement in the field of egocentric action recognition.
This manuscript presents a pioneering Synergistic Neural Agents Network (SynerNet) framework designed to mitigate the phenomenon of cross-modal alignment degeneration in Vision-Language Models (VLMs) when encountering Out-of-Distribution (OOD) concepts. Specifically, four specialized computational units - visual perception, linguistic context, nominal embedding, and global coordination - collaboratively rectify modality disparities via a structured message-propagation protocol. The principal contributions encompass a multi-agent latent space nomenclature acquisition framework, a semantic context-interchange algorithm for enhanced few-shot adaptation, and an adaptive dynamic equilibrium mechanism. Empirical evaluations conducted on the VISTA-Beyond benchmark demonstrate that SynerNet yields substantial performance augmentations in both few-shot and zero-shot scenarios, exhibiting precision improvements ranging from 1.2% to 5.4% across a diverse array of domains.
Abstract. Text-to-image person re-identification (TIReID) is a significant challenge in the cross-modal community, focused on retrieving individuals based on textual queries. The primary obstacle is effectively mapping visual and textual modalities into a shared latent space, a problem inadequately addressed by previous methods that relied on separately pre-trained unimodal models, which often lack the necessary alignment capabilities. Recently, contrastive language-image pre-training (CLIP) has emerged as a versatile large-scale cross-modal visual-language pre-training model, excelling in various cross-modal downstream tasks due to its powerful semantic learning capabilities. CLIP has successfully addressed the need for manual alignment of body part features required by earlier methods. However, in TIReID, integrating CLIP with ReID presents notable alignment issues. It struggles with filtering task-specific irrelevant information, leading to redundancy and interference. In addition, CLIP lacks effective internal alignment, such as misaligned body parts and semantic misalignment. Finally, the joint loss function integrates identification loss, image-text contrastive loss, and mask-based unsupervised training to enhance feature alignment. This new loss structure effectively reduces the risk of over-alignment, ensuring a more balanced training process. Experimental results on CUHK-PEDES benchmarks demonstrate cross-modal latent feature alignment’s effectiveness, surpassing state-of-the-art methods with improvements in rank 1 accuracy by 1.54%.
Industrial anomaly detection (IAD) increasingly benefits from integrating 2D and 3D data, but robust cross-modal fusion remains challenging. We propose a novel unsupervised framework, Multi-Modal Attention-Driven Fusion Restoration (MAFR), which synthesises a unified latent space from RGB images and point clouds using a shared fusion encoder, followed by attention-guided, modality-specific decoders. Anomalies are localised by measuring reconstruction errors between input features and their restored counterparts. Evaluations on the MVTec 3D-AD and Eyecandies benchmarks demonstrate that MAFR achieves state-of-the-art results, with a mean I-AUROC of 0.972 and 0.901, respectively. The framework also exhibits strong performance in few-shot learning settings, and ablation studies confirm the critical roles of the fusion architecture and composite loss. MAFR offers a principled approach for fusing visual and geometric information, advancing the robustness and accuracy of industrial anomaly detection. Code is available at https://github.com/adabrh/MAFR.
Despite the fast progress of deep learning, one standing challenge is the gap of the observed training samples and the underlying true distribution. There are multiple reasons for the causing of this gap e.g. sampling bias, noise etc. In the era of foundation models, we show that when leveraging the off-the-shelf (vision) foundation models (e.g., CLIP, DINOv2) for feature extraction, the geometric shapes of the resulting feature distributions exhibit remarkable transferability across domains and datasets. To verify its practical usefulness, we embody our geometric knowledge-guided distribution calibration framework in two popular and challenging settings: federated learning and long-tailed recognition. In the federated setting, we devise a technique of acquiring the global geometric shape under privacy constraints, then leverage this knowledge to generate new samples for clients, in the aim of bridging the gap between local and global observations. In long-tailed learning, it utilizes the geometric knowledge transferred from sample-rich categories to recover the true distribution for sample-scarce tail classes. Comprehensive experiments show that our proposed geometric knowledge-guided distribution calibration effectively overcomes information deficits caused by data heterogeneity and sample imbalance, with boosted performance across benchmarks.
Inverse Synthetic Aperture Radar (ISAR) imagery of space targets exhibits superior physical fidelity and satisfactory textural representation of components in ISAR images of targets, even under conditions characterized by sparse input optical samples. This paper introduces an innovative optical-to-radar cross-modal framework for the generation of full-attitude, high-fidelity space target ISAR samples, denominated as AORC. Specifically, the attitude encoding module (AEM) assimilates the prior knowledge of analogous targets across different attitudes through a fine-designed NeRF-based encoder, subsequently deriving the encoded features in the latent space. Subsequently, these comprehensive attitude features are input into the modality transformation module (MTM) to undergo a Brownian-Bridge-based diffusion process, facilitating the transformation between optical and ISAR modalities for each feature from each attitude. Extensive simulations on satellite targets validate the effectiveness of the proposed approach.
Dexterous manipulation is a cornerstone capability for robotic systems aiming to interact with the physical world in a human-like manner. Although vision-based methods have advanced rapidly, tactile sensing remains crucial for fine-grained control, particularly in unstructured or visually occluded settings. We present ViTacFormer, a representation-learning approach that couples a cross-attention encoder to fuse high-resolution vision and touch with an autoregressive tactile prediction head that anticipates future contact signals. Building on this architecture, we devise an easy-to-challenging curriculum that steadily refines the visual-tactile latent space, boosting both accuracy and robustness. The learned cross-modal representation drives imitation learning for multi-fingered hands, enabling precise and adaptive manipulation. Across a suite of challenging real-world benchmarks, our method achieves approximately 50% higher success rates than prior state-of-the-art systems. To our knowledge, it is also the first to autonomously complete long-horizon dexterous manipulation tasks that demand highly precise control with an anthropomorphic hand, successfully executing up to 11 sequential stages and sustaining continuous operation for 2.5 minutes.
The emergence of multimodal large language models has redefined the agent paradigm by integrating language and vision modalities with external data sources, enabling agents to better interpret human instructions and execute increasingly complex tasks. However, in this paper, we identify a critical yet previously overlooked security vulnerability in multimodal agents: cross-modal prompt injection attacks. To exploit this vulnerability, we propose CrossInject, a novel attack framework in which attacker embeds adversarial perturbations across multiple modalities to align with target malicious content, allowing external instructions to hijack the agents' decision-making process and execute unauthorized tasks. Our approach incorporates two key coordinated components. First, we introduce Visual Latent Alignment, where we optimize adversarial features to the malicious instructions in the visual embedding space based on a text-to-image generative model, ensuring that adversarial images subtly encode cues for malicious task execution. Subsequently, we present Textual Guidance Enhancement, where a large language model is leveraged to construct the black-box defensive system prompt through adversarial meta-prompting and generate a malicious textual command based on it that steers the agents' output toward better compliance with attacker's requests. Extensive experiments demonstrate that our method outperforms state-of-the-art attacks, achieving at least a +30.1% increase in attack success rates across diverse tasks. Furthermore, we validate our attack's effectiveness in real-world multimodal autonomous agents, highlighting its potential implications for safety-critical applications. Code can be found in https://github.com/Larry0454/CrossInject.
The lack of suitable evaluation metrics hinders the precise measurement of biases in the cross-modal feature space and the distinctiveness of 3D point cloud features, impeding further optimization efforts for enhanced 3D understanding. To tackle these challenges, we present the unified distribution similarity coefficient driven multimodal pre-training for 3D understanding framework, termed CAUD- 3D, providing a deeper understanding of the cross-modal alignment and uni-modal disentanglement process of the multimodal pre-training. Specifically, we generalize class-wise features to a Gaussian distribution, facilitating the quantification of representation quality within the hyper-sphere space through the calculation of the distribution similarity coefficient. To the best of our knowledge, this is the first work to measure the representation quality of cross-modal features from the perspective of the distribution similarity coefficient. Furthermore, we formulate the cross-modal class-wise alignment and uni-modal class-wise discrepancy loss terms to align cross-modal class-wise feature distribution and disentangle the interference among the 3D feature class-wise distribution. Our method significantly outperforms previous works.
In recent years, visual question answering (VQA) has become widely used in multimodal domains, the multimodal semantic gap and data distribution bias between modalities reduce model generalization, limiting improvements in comprehension and reasoning performance. To address these issues, we propose an algorithm termed Contrastive Clustering Algorithm(CCA), a coordinated framework that integrates contrastive learning with a clustering algorithm. CCA utilizes contrastive loss functions to construct cross-modal positive and negative sample pairs, enabling the effective mining and aligning of the semantic information between different modalities. At the same time, it combines with the clustering algorithm to divide the feature space at a fine-grained level, reducing intra-cluster differences and increasing inter-cluster separation for more discriminative feature representations. Extensive experiments on the VQA v2 dataset show CCA significantly enhances the cross-modal comprehension and reasoning ability in cross-modal scenarios, providing an effective approach and new strategies to mitigate semantic and distributional bias in VQA semantic and distributional bias in VQA.
Multimodal representation learning aims to capture both shared and complementary semantic information across multiple modalities. However, the intrinsic heterogeneity of diverse modalities presents substantial challenges to achieve effective cross-modal collaboration and integration. To address this, we introduce DecAlign, a novel hierarchical cross-modal alignment framework designed to decouple multimodal representations into modality-unique (heterogeneous) and modality-common (homogeneous) features. For handling heterogeneity, we employ a prototype-guided optimal transport alignment strategy leveraging gaussian mixture modeling and multi-marginal transport plans, thus mitigating distribution discrepancies while preserving modality-unique characteristics. To reinforce homogeneity, we ensure semantic consistency across modalities by aligning latent distribution matching with Maximum Mean Discrepancy regularization. Furthermore, we incorporate a multimodal transformer to enhance high-level semantic feature fusion, thereby further reducing cross-modal inconsistencies. Our extensive experiments on four widely used multimodal benchmarks demonstrate that DecAlign consistently outperforms existing state-of-the-art methods across five metrics. These results highlight the efficacy of DecAlign in enhancing superior cross-modal alignment and semantic consistency while preserving modality-unique features, marking a significant advancement in multimodal representation learning scenarios. Our project page is at https://taco-group.github.io/DecAlign.
We present Genesis, a unified framework for joint generation of multi-view driving videos and LiDAR sequences with spatio-temporal and cross-modal consistency. Genesis employs a two-stage architecture that integrates a DiT-based video diffusion model with 3D-VAE encoding, and a BEV-aware LiDAR generator with NeRF-based rendering and adaptive sampling. Both modalities are directly coupled through a shared latent space, enabling coherent evolution across visual and geometric domains. To guide the generation with structured semantics, we introduce DataCrafter, a captioning module built on vision-language models that provides scene-level and instance-level supervision. Extensive experiments on the nuScenes benchmark demonstrate that Genesis achieves state-of-the-art performance across video and LiDAR metrics (FVD 16.95, FID 4.24, Chamfer 0.611), and benefits downstream tasks including segmentation and 3D detection, validating the semantic fidelity and practical utility of the generated data.
Due to the lack of effective cross-modal modeling, existing open-source audio-video generation methods often exhibit compromised lip synchronization and insufficient semantic consistency. To mitigate these drawbacks, we propose UniAVGen, a unified framework for joint audio and video generation. UniAVGen is anchored in a dual-branch joint synthesis architecture, incorporating two parallel Diffusion Transformers (DiTs) to build a cohesive cross-modal latent space. At its heart lies an Asymmetric Cross-Modal Interaction mechanism, which enables bidirectional, temporally aligned cross-attention, thus ensuring precise spatiotemporal synchronization and semantic consistency. Furthermore, this cross-modal interaction is augmented by a Face-Aware Modulation module, which dynamically prioritizes salient regions in the interaction process. To enhance generative fidelity during inference, we additionally introduce Modality-Aware Classifier-Free Guidance, a novel strategy that explicitly amplifies cross-modal correlation signals. Notably, UniAVGen's robust joint synthesis design enables seamless unification of pivotal audio-video tasks within a single model, such as joint audio-video generation and continuation, video-to-audio dubbing, and audio-driven video synthesis. Comprehensive experiments validate that, with far fewer training samples (1.3M vs. 30.1M), UniAVGen delivers overall advantages in audio-video synchronization, timbre consistency, and emotion consistency.
Many reasoning techniques for large multimodal models adapt language model approaches, such as Chain-of-Thought (CoT) prompting, which express reasoning as word sequences. While effective for text, these methods are suboptimal for multimodal contexts, struggling to align audio, visual, and textual information dynamically. To explore an alternative paradigm, we propose the Multimodal Chain of Continuous Thought (MCOUT), which enables reasoning directly in a joint latent space rather than in natural language. In MCOUT, the reasoning state is represented as a continuous hidden vector, iteratively refined and aligned with visual and textual embeddings, inspired by human reflective cognition. We develop two variants: MCOUT-Base, which reuses the language model`s last hidden state as the continuous thought for iterative reasoning, and MCOUT-Multi, which integrates multimodal latent attention to strengthen cross-modal alignment between visual and textual features. Experiments on benchmarks including MMMU, ScienceQA, and MMStar show that MCOUT consistently improves multimodal reasoning, yielding up to 8.23% accuracy gains over strong baselines and improving BLEU scores up to 8.27% across multiple-choice and open-ended tasks. These findings highlight latent continuous reasoning as a promising direction for advancing LMMs beyond language-bound CoT, offering a scalable framework for human-like reflective multimodal inference. Code is available at https://github.com/Hanhpt23/OmniMod.
Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion
Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems for content understanding and relevance ranking. Enhancing labeled training data quality and cross-modal fusion significantly improves model performance, influencing key metrics such as quality view rates and ad revenue. High-quality annotations are crucial for advancing content modeling, yet traditional statistical-based active learning (AL) methods face limitations: they struggle to detect overconfident misclassifications and are less effective in distinguishing semantically similar items in deep neural networks. Additionally, audio information plays an increasing role, especially in short-video platforms, yet most pretrained multimodal architectures primarily focus on text and images. While training from scratch across all three modalities is possible, it sacrifices the benefits of leveraging existing pretrained visual-language (VL) and audio models. To address these challenges, we propose kNN-based Latent Space Broadening (LSB) to enhance AL efficiency, achieving an up to 9% recall improvement at 80% precision on proprietary datasets. Additionally, we introduce Vision-Language Modeling with Audio Enhancement (VLMAE), a mid-fusion approach integrating audio into VL models, yielding up to another 9% recall improvement at 80% precision. Our methods are successfully deployed in multiple production systems, leading to significant business gains through online A/B experiments.
The field of cross-modal retrieval aims to construct a shared representation space for samples from multiple modalities, typically within the vision and language domains. Deep hashing, with its high computational efficiency and low storage costs, has emerged as a central focus in this field and has garnered significant attention in recent research. However, current hash retrieval, concentrating on deterministic methods, struggles to effectively capture semantically ambiguous correspondences between cross-modal samples, where heterogeneous data have complex-semantic many-to-many relationships in the latent space. To address this limitation, we propose a novel Deep Probabilistic Binary Embedding (DPBE) framework, designed to generate discriminative, modality-invariant hash codes that facilitate accurate and reliable cross-modal retrieval. In contrast to contemporary probabilistic methods, we focus on optimizing hash networks to learn more accurate binary embeddings by using the learning mode of probabilistic embeddings. We introduce the first Bayesian encoder for hash learning, which employs Laplace Approximation to model a distribution over network weights. Extensive experimental results demonstrate that our approach not only outperforms deterministic methods in retrieval performance but also provides uncertainty estimates, enhancing the interpretability of the embeddings. The corresponding code is available at https://github.com/QinLab-WFU/DPBE.
Cross-modal retrieval requires building a common latent space that captures and correlates information from different data modalities, usually images and texts. Cross-modal training based on the triplet loss with hard negative mining is a state-of-the-art technique to address this problem. This paper shows that such approach is not always effective in handling intra-modal similarities. Specifically, we found that this method can lead to inconsistent similarity orderings in the latent space, where intra-modal pairs with unknown ground-truth similarity are ranked higher than cross-modal pairs representing the same concept. To address this problem, we propose two novel loss functions that leverage intra-modal similarity constraints available in a training triplet but not used by the original formulation. Additionally, this paper explores the application of this framework to unsupervised image retrieval problems, where cross-modal training can provide the supervisory signals that are otherwise missing in the absence of category labels. Up to our knowledge, we are the first to evaluate cross-modal training for intra-modal retrieval without labels. We present comprehensive experiments on MS-COCO and Flickr30k, demonstrating the advantages and limitations of the proposed methods in cross-modal and intra-modal retrieval tasks in terms of performance and novelty measures. We also conduct a case study on the ROCO dataset to assess the performance of our method on medical images and present an ablation study on one of our approaches to understanding the impact of the different components of the proposed loss function. Our code is publicly available on GitHub https://github.com/MariodotR/FullHN.git.
Hybrid motor imagery brain-computer interfaces (MI-BCIs), which integrate both electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS) signals, outperform those based solely on EEG. However, simultaneously recording EEG and fNIRS signals is highly challenging due to the difficulty of colocating both types of sensors on the same scalp surface. This physical constraint complicates the acquisition of high-quality hybrid signals, thereby limiting the widespread application of hybrid MI-BCIs. To address this issue, this study proposes the spatio-temporal controlled diffusion model (SCDM) as a framework for cross-modal generation from EEG to fNIRS. The model utilizes two core modules, the spatial cross-modal generation (SCG) module and the multi-scale temporal representation (MTR) module, which adaptively learn the respective latent temporal and spatial representations of both signals in a unified representation space. The SCG module further maps EEG representations to fNIRS representations by leveraging their spatial relationships. Experimental results show high similarity between synthetic and real fNIRS signals. The joint classification performance of EEG and synthetic fNIRS signals is comparable to or even better than that of EEG with real fNIRS signals. Furthermore, the synthetic signals exhibit similar spatio-temporal features to real signals while preserving spatial relationships with EEG signals. To our knowledge, it is the first work that an end-to-end framework is proposed to achieve cross-modal generation from EEG to fNIRS. Experimental results suggest that the SCDM may represent a promising paradigm for the acquisition of hybrid EEG-fNIRS signals in MI-BCI systems.
Current research on adversarial attacks mainly focuses on RGB trackers, with no existing methods for attacking RGB-T cross-modal trackers. To fill this gap and overcome its challenges, we propose a progressive adversarial patch generation framework and achieve cross-modal stealth. On the one hand, we design a coarse-to-fine architecture grounded in the latent space to progressively and precisely uncover the vulnerabilities of RGB-T trackers. On the other hand, we introduce a correlation-breaking loss that disrupts the modal coupling within trackers, spanning from the pixel to the semantic level. These two design elements ensure that the proposed method can overcome the obstacles posed by cross-modal information complementarity in implementing attacks. Furthermore, to enhance the reliable application of the adversarial patches in real world, we develop a point tracking-based reprojection strategy that effectively mitigates performance degradation caused by multi-angle distortion during imaging. Extensive experiments demonstrate the superiority of our method.
With 5G and IoT booming, explosive multimodal data growth challenges communication bandwidth and retrieval accuracy. Cross-modal hashing stands out in cross-modal retrieval tasks owing to low storage requirements and swift retrieval advantages of binary encoding. However, existing methods often rely on coarse-grained semantics and single supervision information, ignoring the impact of fine-grained semantics and joint supervision. To handle the above dilemmas, this paper proposes Multi-Semantic Embedding Hashing (MSEH) for large-scale cross-modal retrieval. First, modalityspecific representations are learned to explore modality-private semantic information. Then, fine-grained semantic information is mined through multimodal latent space learning and semantic center learning. Finally, multiple semantics are embedded, and hash code learning is jointly supervised. Extensive empirical tests on three classic datasets demonstrate MSEH outperforms 6 state-of-the-art methods.
Deep learning (DL)-based pansharpening has been widely applied in high-resolution imaging. Yet, artifacts related to generalization and oversmoothing have continuously been the challenge, primarily due to the mismatch between the simulation dataset and the unseen real-world scenarios. Current approaches address these through unsupervised frameworks or generative models, while modal inconsistency is not fully considered, leading to suboptimal performance. In this article, we propose a contrastive cross-modal framework via uncertainty guidance (UGCC), which comprises three key modules: a contrast feature enhancement module (CFEM), a cross-modal compensation module (CMCM), and an uncertainty guidance module (UGM). First, to enhance generalization and reduce overfitting, CFEM is introduced. Robust contrast features are augmented and learned sparsely in latent space, where sample distributions are refined, and redundant information is filtered from highly similar sample pairs for enhanced training stability. Furthermore, CMCM mitigates modal inconsistency effectively by domain transfer and collaborative attention, achieving efficient modal separation and interaction. Finally, to adaptively balance the performance of CMCM and CFEM based on prediction confidence, a hybrid loss function is designed, where UGM adjusts the weights through quantifying statistical-versus-structural uncertainties. Extensive experiments on Quickbird, Gaofen-2, WorldView-2, and WorldView-3 demonstrate that the performance of the proposed method surpasses or matches the state of the arts. Furthermore, ablation studies validate the effectiveness of each component. The code is now available at: https://github.com/meimeizeng/UGCF.
Video retrieval requires aligning visual content with corresponding natural language descriptions. In this paper, we introduce Modality Auxiliary Concepts for Video Retrieval (MAC-VR), a novel approach that leverages modality-specific tags -- automatically extracted from foundation models -- to enhance video retrieval. We propose to align modalities in a latent space, along with learning and aligning auxiliary latent concepts derived from the features of a video and its corresponding caption. We introduce these auxiliary concepts to improve the alignment of visual and textual latent concepts, allowing concepts to be distinguished from one another. We conduct extensive experiments on six diverse datasets: two different splits of MSR-VTT, DiDeMo, TGIF, Charades and YouCook2. The experimental results consistently demonstrate that modality-specific tags improve cross-modal alignment, outperforming current state-of-the-art methods across three datasets and performing comparably or better across others. Project Webpage: https://adrianofragomeni.github.io/MAC-VR/
Recent advances in diffusion models have enhanced multimodal-guided visual generation, enabling customized subject insertion that seamlessly “brushes” user-specified objects into a given image guided by textual prompts. However, existing methods often struggle to insert customized subjects with high fidelity and align results with the user’s intent through textual prompts. In this work, we propose In-Context Brush, a zero-shot framework for customized subject insertion by reformulating the task within the paradigm of in-context learning. Without loss of generality, we formulate the object image and the textual prompts as cross-modal demonstrations, and the target image with the masked region as the query. The goal is to inpaint the target image with the subject aligning textual prompts without model tuning. Building upon a pretrained MMDiT-based inpainting network, we perform test-time enhancement via dual-level latent space manipulation: intra-head latent feature shifting within each attention head that dynamically shifts attention outputs to reflect the desired subject semantics and inter-head attention reweighting across different heads that amplifies prompt controllability through differential attention prioritization. Extensive experiments and applications demonstrate that our approach achieves superior identity preservation, text alignment, and image quality compared to existing state-of-the-art methods, without requiring dedicated training or additional data collection. Project page: https://yuci-gpt.github.io/In-Context-Brush/.
Climate change is intensifying wildfire risks globally, making reliable forecasting critical for adaptation strategies. While machine learning shows promise for wildfire prediction from Earth observation data, current approaches lack uncertainty quantification essential for risk-aware decision making. We present the first systematic analysis of spatial uncertainty in wildfire spread forecasting using multimodal Earth observation inputs. We demonstrate that predictive uncertainty exhibits coherent spatial structure concentrated near fire perimeters. Our novel distance metric reveals high-uncertainty regions form consistent 20-60 meter buffer zones around predicted firelines - directly applicable for emergency planning. Feature attribution identifies vegetation health and fire activity as primary uncertainty drivers. This work enables more robust wildfire management systems supporting communities adapting to increasing fire risk under climate change.
Explainable Artificial Intelligence (XAI) is now crucial for ensuring the safe use of deep learning models in hospitals and clinics. This holds especially true for medical imaging, where reliably informed diagnoses are difficult to make when the model’s workings are unknown. Although existing interpretability methods, such as saliency maps, attribution techniques, and prototype-based reasoning, give some insights into how models work, they typically address only one type of inputs at a time. In real clinical work, though, doctors often deal with multimodal imaging like MRI, CT, and retinal scans. Each one brings its own resolution, contrast, and anatomical details. Therefore, this paper presents MedXAIMM taht is a unified framework for multi-modality explainability. The framework pulls together three main parts. First, there is Grad-CAM++ for spotting spatial saliency. Second, DeepSHAP handles pixel-level feature attribution. Third, Case-Based Retrieval, or CBR, adds clinical context through evidence from similar cases. The setup uses a hybrid backbone of ResNet-50 and Swin Transformer. This allows for pulling out both local and global features in a complementary way. Then, attention-based fusion helps with learning representations that account for different modalities. We also bring in a new metric called the Cross-Modality Fidelity Score, or CMFS. It measures how consistent explanations stay across various imaging types that differ a lot. Tests on datasets like BraTS-MRI, ChestX-ray14, and DRIVE show strong results. MedXAI-MM reaches higher faithfulness, (up to $+18 \%$) up to eighteen percent better. It also improves localization IoU by twelve percent $(+12 \%)$. Plus, clinicians rate its interpretability higher than top baselines. Overall, the findings point to how unified multimodal interpretability can bridge the gap between accurate diagnostics and clear transparency in medicine. This pushes AI closer to being ready for everyday clinical use.
Bipolar disorder (BD) is a debilitating mental illness characterized by significant mood swings, posing a substantial challenge for accurate diagnosis due to its clinical complexity. This paper presents CS2former, a novel approach leveraging a dual channel-spatial feature extraction module within a Transformer model to diagnose BD from resting-state functional MRI (Rs-fMRI) and T1-weighted MRI (T1w-MRI) data. CS2former employs a Channel-2D Spatial Feature Aggregation Module to decouple channel and spatial information from Rs-fMRI, while a Channel-3D Spatial Attention Module with Synchronized Attention Module (SAM) concurrently computes attention for T1w-MRI feature maps. This dual extraction strategy is coupled with a Transformer, enhancing feature integration across modalities. Our experimental results on two datasets, including the OpenfMRI and our collected datasets, demonstrate CS2former's superior performance. Notably, the model achieves a 10.8% higher Balanced Accuracy on our dataset and a 5.7% improvement on the OpenfMRI dataset compared to the baseline models. These results underscore CS2former's innovation in multimodal feature fusion and its potential to elevate the efficiency and accuracy of BD diagnosis.
Multimodal magnetic resonance imaging (MRI) is vital for the precise segmentation of brain tumors. However, missing or incomplete multimodal data is a frequent challenge in clinical practice, significantly affecting segmentation performance. Current advanced methods primarily focus on fusing multimodal images in the spatial domain, often neglecting the interplay between different modalities in the frequency domain. In this work, we propose a novel framework named spatial and frequency feature recalibration Transformer (SFFR-Transformer), which utilizes a frequency and spatial hybrid multihead attention (FSHMA) Transformer. This approach facilitates the complementary fusion of spatial- and frequency-domain information, enhancing the reconstruction of missing modalities. Moreover, most existing methods map fused modalities directly to all segmentation targets. Inspired by the correlation between single modalities and specific subtargets, we introduce a modality-subtarget matching module (MSTM). This module decouples the fusion modalities from the segmentation targets, enabling more accurate mapping between single modalities and their corresponding subtargets. Comprehensive experiments on the publicly available BraTS2018 and BraTS2020 datasets demonstrate that our framework surpasses state-of-the-art methods, particularly in scenarios involving missing modalities.
Enhancing Spatial Reasoning in Multimodal Vision-Language Models via Depth-Aware Feature Integration
Vision-language models such as Contrastive Language-Image Pre-training (CLIP) excel at aligning images and text, yet they still struggle to reason about spatial relationships within a scene. We present a lightweight extension to CLIP that injects a depth modality, fusing RGB and depth features while preserving the model’s pre-trained visual-textual knowledge. Trained and evaluated on dedicated spatial-reasoning benchmarks, our depth-enhanced architecture yields substantial accuracy gains over an RGB-only baseline and produces more human-like judgments of relative position and distance. These findings chart a clear path from 2-D image-language models toward true 3-D spatial understanding. Crucially, the compact design retains the rich semantics learned during CLIP’S large-scale pre-training, enabling markedly faster convergence while matching the final accuracy achieved by training from scratch.
Recent advancements in multimodal large language models (MLLMs) have shown promising results, yet existing approaches struggle to effectively handle both temporal and spatial localization simultaneously. This challenge stems from two key issues: first, incorporating spatial-temporal localization introduces a vast number of coordinate combinations, complicating the alignment of linguistic and visual coordinate representations; second, encoding fine-grained temporal and spatial information during video feature compression is inherently difficult. To address these issues, we propose LLaVA-ST, a MLLM for fine-grained spatial-temporal multimodal understanding. In LLaVA-ST, we propose Language-Aligned Positional Embedding, which embeds the textual coordinate special token into the visual space, simplifying the alignment of fine-grained spatial-temporal correspondences. Additionally, we design the Spatial-Temporal Packer, which decouples the feature compression of temporal and spatial resolutions into two distinct point-to-region attention processing streams. Furthermore, we propose ST-Align dataset with 4.3M training samples for fine-grained spatial-temporal multimodal understanding. With ST-align, we present a progressive training pipeline that aligns the visual and textual feature through sequential coarse-to-fine stages. Additionally, we introduce an ST-Align benchmark to evaluate spatial-temporal interleaved fine-grained understanding tasks, which include Spatial-Temporal Video Grounding (STVG) , Event Localization and Captioning (ELC) and Spatial Video Grounding (SVG). LLaVA-ST achieves outstanding performance on 11 benchmarks requiring fine-grained temporal, spatial, or spatial-temporal interleaving multimodal understanding. Our code, data and benchmark will be released at https://github.com/appletea233/LLaVA-ST.
The rapid development of sensor and multimodal technology has provided more possibilities for multisource remote sensing image classification. However, some existing joint classification methods are limited to single-level feature fusion and fail to fully explore the deep correlation between cross-level features, thus limiting the effective interaction and complementarity of information between different modal data. To alleviate this issue, this article proposes a hierarchical multimodal feature aggregation-based multihead axial attention transformer (HMAT) for joint classification of hyperspectral and light detection and ranging (LiDAR) data. First, a hierarchical multimodal feature aggregation module (HMFA) is proposed to more effectively fuse spatial–spectral features of hyperspectral images (HSIs) and elevation features of LiDAR data and generate more discriminative low-dimensional feature representations. Second, a pyramid-inverted pyramid convolution module (PIP) is designed. Through the complementary feature extraction structure, PIP can more fully capture the multiscale local features in the fused feature map of hyperspectral and LiDAR data. Finally, a multihead axial attention (MHAA) component is constructed to capture information at different scales in the fused feature maps, thereby accurately modeling global dependencies. The proposed HMAT has been extensively tested on three publicly available datasets. The experimental results demonstrate that the classification performance of the proposed method outperforms that of several state-of-the-art methods.
This research propose an explainable AI-driven framework for optimizing 2D character merchandise marketing content, addressing the critical gap between conventional heuristic-driven strategies and data-driven decision-making. The proposed system integrates causal feature attribution and attention-guided generation to systematically model the relationship between content attributes and user engagement dynamics. At its core, a feature attribution engine quantifies the impact of visual and textual elements using Shapley values, while a vision-language transformer prioritizes high-attention regions during content creation. Furthermore, a Bayesian optimization loop iteratively refines marketing strategies based on real-time feedback, dynamically adjusting design parameters and posting schedules. The framework uniquely bridges interpretable AI with creative workflows, enabling marketers to make quantifiable adjustments rather than relying on intuition. Our implementation leverages state-of-the-art multimodal transformers and accelerated Shapley value approximations, ensuring scalability without sacrificing interpretability. Experimental results demonstrate that the system outperforms traditional methods in engagement metrics, particularly in click-through rates and user retention. The novelty lies in its closed-loop feedback mechanism, where explainable insights directly parametrize content generation tools, fostering a symbiotic relationship between machine intelligence and human creativity. This work contributes to both the AI and marketing communities by providing a transparent, adaptive solution for content optimization in the rapidly growing 2D character merchandise industry.
Functional magnetic resonance imaging (fMRI) is a powerful tool for probing brain function, yet reliable clinical diagnosis is hampered by low signal-to-noise ratios, inter-subject variability, and the limited frequency awareness of prevailing CNN- and Transformer-based models. Moreover, most fMRI datasets lack textual annotations that could contextualize regional activation and connectivity patterns. We introduce RTGMFF, a framework that unifies automatic ROI-level text generation with multimodal feature fusion for brain-disorder diagnosis. RTGMFF consists of three components: (i) ROI-driven fMRI text generation deterministically condenses each subject's activation, connectivity, age, and sex into reproducible text tokens; (ii) Hybrid frequency-spatial encoder fuses a hierarchical waveletmamba branch with a cross-scale Transformer encoder to capture frequency-domain structure alongside long-range spatial dependencies; and (iii) Adaptive semantic alignment module embeds the ROI token sequence and visual features in a shared space, using a regularized cosine-similarity loss to narrow the modality gap. Extensive experiments on the ADHD-200 and ABIDE benchmarks show that RTGMFF surpasses current methods in diagnostic accuracy, achieving notable gains in sensitivity, specificity, and area under the ROC curve. Code is available at https://github.com/BeistMedAI/RTGMFF.
The basic task of vehicle density estimation is to use image information to estimate the distribution and quantity of vehicles within it. However, many previous methods only use the optical information in red-green–blue (RGB) images, which makes it difficult to effectively identify potential vehicles under poor light, strong reflections, and bad weather, resulting in unsatisfactory density estimation performance. To address these problems, we consider introducing thermal images to provide a richer source of information for the vehicle density estimation task, and propose a multimodal feature fusion network (MFCNet) for accurate RGB-Thermal (RGB-T) vehicle density estimation. First, multimodal features are cross-integrated through the attention-guided multiscale feature fusion coordination module (MFFC) to compensate for the limitations of single modal features. Following this, the edge feature calibration module (EFC) is utilized to correct the spatial misalignment regions between modalities. Subsequently, the adaptive deep fusion module (ADFM) is applied to further refine the features on the global scale and improve the intermodality correlation. Finally, the features of different stages are fused step by step to obtain the final fused feature, which is fed into a simple regression header to generate a pixel-level vehicle density map. Experimental results show that the GAME2 and root mean square error of the proposed method are reduced to 5.21 and 3.54 on the DroneVehicle dataset, respectively. Compared with existing vehicle density estimation methods, MFCNet achieves competitive accuracy and can be applied to the vehicle density estimation task in unconstrained scenarios. Our codes will be available at https://github.com/QLingX/MFCNet.
In the field of synthetic aperture radar automatic target recognition (SAR ATR), inherent distributional discrepancies between electromagnetic synthetic and measured SAR images pose significant challenges to the potential applications of the former. To bridge the gap, a novel unsupervised domain adaptation framework based on multimodal feature fusion and global-local joint alignment (MFJA) is proposed in this article. The multimodal feature fusion focuses on describing each target more comprehensively by leveraging both visual and scattering topological information. In the visual branch, the full-aperture image is decomposed into multiple subaperture images to explore the scattering variations in the target at different azimuths, facilitating a richer visual description. Meanwhile, both local scattering and spatial position information of keypoints are simultaneously integrated into the feature extraction in the scattering topological branch, promoting a more comprehensive scattering topological representation. Subsequently, a gated feature fusion module (GFFM) is developed to effectively fuse features derived from different modalities. The global-local joint alignment aims to align different domains with greater precision. Specifically, a power normalized weighted gradient reversal layer (PN-WGRL) is proposed to guide the network to focus more on hard-to-align samples during global domain alignment, thus mitigating their interference with local domain alignment. While MFJA achieves satisfactory cross-domain recognition performance, its inference efficiency is somewhat constrained. Therefore, a domain-invariant cross-modal knowledge distillation (DCKD) algorithm with a tri-path collaborative alignment strategy is further developed to distill discriminative and domain-invariant knowledge from the multimodal model into a compact visual model based on full-aperture images, thereby accelerating inference. Experiments conducted in three scenarios on the public synthetic and measured paired labeled experiment (SAMPLE) dataset validate the effectiveness of both MFJA and DCKD.
The joint classification of hyperspectral image (HSI) and light detection and ranging (LiDAR) data seeks to provide a more comprehensive characterization of target objects. Multimodal data possess distinct semantic structures in both spectral and spatial dimensions, making efficient feature complementarity and redundancy elimination crucial. To this end, we propose a self-distillation-based multimodal feature alignment network (DFANet), which employs two branches to capture spectral and spatial similarities, respectively, and integrates structural discriminative information from LiDAR at two stages for more effective multimodal data integration. The network comprises three main components: a feature alignment fusion module (FAFM), an offset attention module (OAM), and a self-distillation mechanism. Specifically, the FAFM guides feature alignment through channel-assimilative mapping of multimodal data. The OAM addresses boundary patch classification challenges by learning offset weights of reference points. The self-distillation mechanism filters out irrelevant information during feature alignment by enhancing the coordination between high-level and low-level features. Adequate experiments indicate that our method achieves better results compared to the most recent hyperspectral classification methods on three public datasets.
Forward-looking sonar (FLS) image segmentation can help reduce the amount of raw data that needs to be transmitted in underwater communication systems, making it a crucial technique for next-generation communication systems and the Internet of Things (IoT). However, its effectiveness is often hindered by weak semantic information, blurry edges and low resolution, which pose challenges for current segmentation algorithms. In this study, we propose a multimodal feature-enhanced Unet for FLS image segmentation (MFEUnet), built upon the Unet framework. The multimodal features considered primarily include spatial and frequency features. For spatial features, recognizing Unet’s strength in local feature extraction, we integrate a transformer to enhance its ability to capture global features. Frequency features are utilized to capture different details of FLS images, with a dual-branch wavelet transformation employed to decompose images into low-frequency and high-frequency components, facilitating the enhancement of these features. And a preprocessing reconstruction module is integrated to reduce the noise of FLS images. Furthermore, to address class imbalance in FLS datasets, we design a specialized segmentation loss function. Experimental results show that MFEUnet significantly outperforms state-of-the-art segmentation methods, demonstrating its effectiveness in overcoming the unique challenges of underwater sonar imaging.
The traditional online sports education action normality matching algorithm relies only on image data and is susceptible to environmental interference, resulting in misjudgments and omissions. To this end, an algorithm based on multimodal feature fusion is proposed. The algorithm first performs multimodal feature extraction on sports movements, integrating multiple data sources such as image data and sensor data. The image data includes spatial features such as joint position, body contour, posture, limb angle, and motion trajectory, which are obtained through color images and depth infrared cameras of Kinect devices; The sensor data includes time features such as acceleration, angular velocity, force, and motion rhythm, which are collected by the built-in sensors of the Kinect device. The weighted fusion method is used to fuse the features, assign weights, and then evaluate the normality of the actions through keyframe matching. In the experimental stage, two mainstream indoor action recognition datasets, NTU RGB+D 60 and NTU RGB+D 120, were selected to select three sets of actions with similar joint coordinates to construct a similar action dataset for verification. The experiment shows that the algorithm can more accurately identify and match joint standard actions without misjudgment or omission. The Kappa coefficient of all tested actions exceeds 0.8, and it has significant advantages in matching accuracy, applicability, and generalization ability.
Different modalities of neuroimaging data can provide complementary lesion information for the diagnosis of Alzheimer's disease. However, existing methods face trade-offs in obtaining three-dimensional lesion features and suppressing redundant 3D features. It is difficult to simultaneously preserve the ability of 2D networks to reduce redundant features and lower training costs while extracting features of the 3D lesion regions with spatial structure. To address this challenge, this paper proposes a multi-dimensional integrated multimodal feature fusion AD prediction network (MDMF-Net) to obtain low-redundancy lesion information with 3D spatial structural features. First, a multi-dimensional joint feature extraction module is proposed, where a 2D network is used to acquire 2D lesion features and generate a saliency feature map. Based on the saliency feature map, key lesion blocks are segmented from the 3D brain images to extract 3D lesion features. Secondly, a multimodal multi-view perception feature fusion module is designed to reduce heterogeneity of features from different modalities and dimensions through various attention mechanisms and a dual-stage fusion strategy. Finally, a saliency feature loss is introduced to enhance the saliency response of the key lesion regions while suppressing the response from irrelevant regions. Experimental results on the ADNI database show that our model achieves a classification accuracy of 89.12 %, outperforming several competitive methods in terms of prediction performance.
Existing fall detection methods face three critical challenges in complex dynamic environments: high false alarm rates, insufficient modeling of long-duration action sequences, and privacy risks from data leakage. Traditional unimodal models struggle to capture the multi-stage causal evolution of fall motions comprehensively. To address these limitations, we propose a Spatio-Temporal Collaborative Attention Network (STCANet) that integrates bidirectional spatio-temporal attention mechanisms with optimized multimodal feature fusion, significantly enhancing detection accuracy and efficiency. The architecture employs a dual-path Transformer framework, jointly modeling spatial joint correlations and temporal causal chains through space $\rightarrow$ time and time $\rightarrow$ space pathways. Additionally, a fusion framework combining kinematic features (centroid velocity/joint angular velocity) with geometric features (silhouette deformation/aspect ratio) is designed to strengthen the model's discriminative power for fall recognition. Furthermore, a lightweight skeleton-based data anonymization model is developed to ensure privacy security while achieving synergistic optimization of both privacy and computational efficiency. Experimental results on the Human3.6M dataset demonstrate a detection precision of 94.2% and recall rate of 93.8%, with false alarms reduced by 74.7% compared to state-of-the-art methods. The model requires only 1.2M parameters and achieves real-time inference at 167 FPS.
The aggregation of multimodal features in medical image registration remains underexplored, limiting the performance of current models in capturing complex anatomical relationships. Traditional convolutional neural networks (CNNs) often overlook the rich semantic information available from text, while existing approaches lack effective methods to combine spatial and contextual cues. In this paper, we propose Text Aggregation for Medical Image Registration (TA-MIR), a novel framework that enhances encoder-decoder architecture by incorporating anatomical text embeddings throughout the registration process. By employing large kernel blocks for improved receptive fields in U-Net and fusion blocks at each level, our model effectively integrates image features with semantic text information. Extensive experiments on three brain MRI datasets-OASIS, IXI, and LPBA40-demonstrate that our approach achieves state-of-the-art performance, significantly improving registration accuracy and anatomical coherence compared to traditional CNN and Transformer-based methods.
Children with Autism Spectrum Disorder (ASD) face significant difficulties in emotional expression and recognition, and traditional manual observation methods struggle to capture their weak and transient micro-expression features. To address this issue, this paper proposes an autism emotion recognition model that integrates Vision Transformer with multimodal features. The model first employs the TVL1 optical flow algorithm to extract facial motion features and utilizes Vision Transformer to model long-range dependencies between different facial regions. Subsequently, it introduces a feature selection fusion module (FSFM) to filter key image patches, a cross-attention fusion module (CAFM) to integrate horizontal and vertical optical flow information, and designs a spatial consistency attention module (SCAM) to ensure feature distribution consistency. Finally, it incorporates Maximally Collapsing Metric Learning (MCML) to optimize the feature space structure. On standard micro-expression databases including MMEW, CASMEII, and SAMM, this method achieves recognition accuracies significantly superior to existing approaches (73.0%, 76.4%, and 70.5%, respectively). Furthermore, the proposed method demonstrates good generalization capability and real-time performance, showing promise as an intelligent assistive tool for special education to enhance teachers’ understanding of emotional states in children with autism and improve intervention efficiency, thereby promoting personalized education development.
Diabetic Retinopathy (DR) remains a leading preventable cause of visual disability globally, necessitating strong automated screening methods. In this paper, we present a novel dual-network architecture that models spatial relations in retinal pathology explicitly-a primary limitation of traditional deep learning approaches to DR grading. Our objective is to enhance classification performance via strategic fusion of convolutional feature learning and graph-based spatial reasoning. Our approach transforms CNN-extracted features into structured graph representations that enable high-level modeling of interregional relations using attention mechanisms. We demonstrate comprehensive testing on the APTOS2019 dataset to attain significant improvement in multi-grade DR classification accuracy and Quadratic Kappa Score over state-of-the-art baseline approaches. Our results quantitatively confirm that explicit spatial feature relationship modeling significantly enhances automated DR severity grading. This paper advances the boundaries of automated retinal disease grading by presenting an effective architectural model that integrates convolutional and graph neural networks to learn local pathological features and their global contextual correlations.
Aiming at the problems of spatiotemporal alignment, insufficient modal interaction, and weak modeling of long-term dependencies in abnormal behavior detection based on video and location, a multi-modal feature fusion method for abnormal behavior detection based on attention mechanisms is proposed. The proposed method consists of three modules: feature extraction, multi-modal feature fusion, and anomaly detection. Firstly, in the feature extraction module, we utilize the ViViT (Video Vision Transformer) model to extract video features, capturing spatiotemporal action continuity through 3D block embedding; we utilize the ST-GCN (Spatial Temporal Graph Convolutional Networks) to extract localization features, constructing the spatiotemporal graph with the location of target personnel as nodes and spatiotemporal movement correlation as edges. Then, in the multi-feature fusion module, we utilize the cross-attention mechanism to fuse video and localization features to improve the expression ability of cross-modal data features, and reduces the number of network parameters by sharing parameter strategies. Additionally, we apply the spatiotemporal separation attention mechanism to jointly the model the spatiotemporal correlation between video action features and trajectory of target personnel. At the same time, a dynamic gating fusion strategy is introduced to dynamically adjust weights based on the quality of video and localization data during the training. Finally, in the anomaly detection module, a multi-granularity decoder is used to parallelly generate frame-level video anomaly probabilities and trajectory segment anomaly scores, with the spatiotemporal consistency loss function constraining their alignment, which is to ensure the spatiotemporal matching of behavior and trajectory anomalies. This framework offers an efficient and robust multi-modal analysis approach for security monitoring systems.
The clinical diagnosis of depression depends on subjective scales, which lack objectivity. Thus, precise auxiliary diagnostic tools are urgently needed. Although electroencephalography (EEG) can offer an objective foundation for auxiliary depression diagnosis, its complex spatiotemporal and spectral features present substantial challenges for feature extraction. To overcome the limitations of existing methods in multidimensional information fusion and key feature selection, we propose a dual-stream spatiotemporal and frequency-spatial attention U-Net model, termed DSMFF-UNet. The model processes spatiotemporal and frequency-spatial 3D EEG representations in parallel. It adaptively selects and weights the most discriminative deep features via attention modules integrated into the U-Net skip connections. On the public MODMA dataset, DSMFF-UNet achieved 96.33% accuracy and an F1 score >0.96, substantially outperforming existing baselines. These findings indicate that the deep integration and adaptive emphasis on multidimensional EEG features offer an effective approach for high-accuracy automated depression detection. This lays the groundwork for objective clinical diagnostic aids.
Brain–computer interface (BCI) is an important way of human-computer interaction, with the ability to monitor brain states, and it has become an increasingly significant research direction. Single-modal noninvasive brain signals have limitations, such as low spatial resolution or low temporal resolution, while multimodal brain signal acquisition and processing can overcome these limitations. Electroencephalogram and functional near-infrared spectroscopy (EEG-fNIRS) is a method with advantages in multimodal brain signal processing, but current fusion methods mostly use manual feature extraction or channel selection, which may lead to the loss of important information during the feature extraction or channel selection process in real-time BCI systems. In order to solve this issue, this article proposes an innovative fusion analysis method for EEG-fNIRS multimodal brain signals, using a hybrid algorithm that combines convolutional neural network (CNN) and Attention mechanisms for signal classification. The method first preprocesses the EEG and fNIRS signals separately, then extracts features using spatial-temporal convolutional layers, and finally merges them to obtain the classification results through dual attention calculation. Our method is validated on two publicly available mixed EEG-fNIRS BCI datasets, including three types of experimental tasks that do not involve actual movement: motor imagery (MI), mental arithmetic, and word generation (WG). The accuracy rates for each task reached 92.2% for MI, 98.6% for mental arithmetic, and 95.2% for WG, respectively. These rates have surpassed all the current methods. This indicates that our proposed method achieves better classification performance in non-actual movement classification tasks under the premise of lightweight. The method proposed in this study can be applied to the field of rapid and efficient identification of brain signals.
Recent advancements in emotion recognition research based on physiological data have been notable. However, existing multimodal methods often overlook the interrelations between various modalities, such as video and electroencephalography (EEG) data, in emotion recognition. In this article, a feature fusion-based hierarchical cross-modal spatial fusion network (HCSFNet) is proposed that effectively integrates EEG and video features. By designing an EEG feature extraction network based on 1-D convolution and a video feature extraction network based on 3-D convolution, corresponding modality features are thoroughly extracted. To promote sufficient interaction between the two modalities, a hierarchical cross-modal coordinated attention module is proposed in this article. Additionally, to enhance the network's perceptual ability for emotion-related features, a multiscale spatial pyramid pooling module is also designed. Meanwhile, a self-distillation method is introduced, which enhances the performance while reducing the number of parameters in the network. The HCSFNet achieved an accuracy of 97.78% on the valence–arousal dimension of the Database for Emotion Analysis using Physiological Signals (DEAP) dataset, and it also obtained an accuracy of 60.59% on the MAHNOB-human-computer interaction (HCI) dataset, reaching the state-of-the-art level.
Through Internet of Things (IoT) communication technology, collaborative perception enhances a vehicle’s capacity to discern its surroundings while driving by integrating and synchronizing sensor data from multiple agents. With the advancement of cooperative perception techniques in single-modality methods, there has been a growing trend toward integrating multimodal data from heterogeneous sensors in recent years. However, due to the data heterogeneity inherent in diverse sensors, Bird’s Eye View (BEV) maps generated from different types of sensors may exhibit local discrepancies in the spatial representation of entity positions. Furthermore, individual agents may produce uncertain and flawed feature representations in real noisy environments. The influence of this indeterminacy exacerbates the issue of local inconsistency, leading to misalignment of the detected target during BEV alignment and fusion, thereby reducing detection accuracy. To address these problems, we propose a modal decision-making spatial alignment cooperative perception network (MDNet). First, the network generates BEV feature maps through dense depth image supervision for voxel feature extraction and model-guided selective feature fusion. Subsequently, we achieve enhanced accuracy in object detection by performing spatial alignment of BEV representations generated from two distinct sensors, both globally and locally within the spatial domain. Besides, we employ a cascaded centralized pyramid strategy during the message fusion stage, facilitating flexible sampling across horizontal and vertical spatial dimensions, promoting deep interaction among multiple agents. We conduct quantitative and qualitative experiments on the public OPV2V and DAIR-V2X-C benchmarks, and our proposed MDNet exhibits superior performance and stronger robustness in the 3-D object detection task, providing more precise target detection results.
Although remote sensing (RS) data with multiple modalities can be used to significantly improve the accuracy of semantic segmentation in RS data, how to effectively extract multimodal information through multimodal feature fusion remains a challenging task. Specifically, existing methods for multimodal feature fusion still face two major challenges: 1) due to the diverse imaging mechanisms of multimodal RS data, the boundaries of the same foreground may vary across different modalities, leading to the inclusion of unwanted background semantics in the fused foreground features, and 2) RS data from different modalities exhibit varying discriminative abilities for different foregrounds, making it challenging to determine the proportion of semantic information for each modality in the fusion results. To address the above issues, we propose a dynamic feature fusion method based on region-wise queries, namely, DF2RQ, for SS of multimodal RS data. This method is primarily composed of two components: the spatial reconstruction (SR) module and the dynamic fusion (DF) module. Within the SR module, we propose an SR scheme that samples foreground features from different modalities, achieving independent reconstruction of different unimodal features, thereby alleviating the semantic mixing between foreground and background across modalities. In the DF module, a feature fusion scheme based on unimodal feature reference positions is proposed to obtain fusion weights for each modality, thereby enabling the DF of complementary features from multiple modalities. The performance of the proposed method has been extensively evaluated on various multimodal RS datasets for SS, and the experimental results consistently show that the proposed method achieves state-of-the-art (SOTA) accuracy on multiple commonly used metrics. In addition, our code is available at https://github.com/I3ab/DF2RQ.
Accurate detection and precise localization of anomalies during precision component manufacturing are essential to maintaining high product quality. Multimodal industrial anomaly detection (MIAD) harnesses data from diverse sensors to effectively identify and pinpoint defects in industrial products. Recent MIAD approaches have made significant progress but often ignore point cloud data global contextual semantics and modality-specific information, resulting in an incomplete representation of point cloud and inadequate multimodal fusion. To confront these issues head-on, we propose a robust feature representation and comprehensive multimodal feature fusion network [views-graph and latent feature disentangled fusion network (VLDFNet)] for anomaly detection in industrial high-precision components. VLDFNet mainly consists of a point cloud views-graph representation model and a multimodal disentangled feature latent space fusion module. Specifically, the point cloud views-graph representation model explores spatial locations and semantic relationships between views using multilevel graph fusion. The multimodal disentangled feature latent space fusion module disentangles multimodal features into shared and specific representations to mitigate the omission of modality-specific information. VLDFNet introduces a cross-modal shared feature interaction (CSFI) strategy to extract coherent semantic information by aligning and integrating cross-modal features. Comprehensive experimental results on multiple datasets demonstrate that our method significantly outperforms existing approaches in detection accuracy.
Urban analytics increasingly relies on AI-driven trajectory analysis, yet current approaches suffer from methodological fragmentation: trajectory learning captures movement patterns but ignores spatial context, while spatial embedding methods encode street networks but miss temporal dynamics. Three gaps persist: (1) lack of joint training that integrates spatial and temporal representations, (2) origin-agnostic treatment that ignores directional asymmetries in navigation ($A \to B \ne B \to A$), and (3) over-reliance on auxiliary data (POIs, imagery) rather than fundamental geometric properties of urban space. We introduce a conditional trajectory encoder that jointly learns spatial and movement representations while preserving origin-dependent asymmetries using geometric features. This framework decomposes urban navigation into shared cognitive patterns and origin-specific spatial narratives, enabling quantitative measurement of cognitive asymmetries across starting locations. Our bidirectional LSTM processes visibility ratio and curvature features conditioned on learnable origin embeddings, decomposing representations into shared urban patterns and origin-specific signatures through contrastive learning. Results from six synthetic cities and real-world validation on Beijing's Xicheng District demonstrate that urban morphology creates systematic cognitive inequalities. This provides urban planners quantitative tools for assessing experiential equity, offers architects insights into layout decisions'cognitive impacts, and enables origin-aware analytics for navigation systems.
With the growing interest in 3D Gaussian Splatting (3DGS) for scene analysis, current approaches typically extract 2D features from multi-view images via pre-trained models before embedding them into 3DGS representations. Such multi-stage pipelines not only incur computational overhead but also fail to holistically leverage the geometric and appearance properties of 3DGS. We introduce GARNET, the first end-to-end feature extraction framework for pre-optimized 3DGS. It directly renders Gaussian primitives into feature maps, enabling joint learning of 3D structure and 2D texture. The framework comprises a Gaussian Propagator for information exchange among neighboring primitives and a Feature Renderer that generates viewpoint-specific feature maps to direct model attention toward discriminative regions. By employing 3D object classification as a case study, GARNET achieves 93.83% accuracy on texturerich MACGS, surpassing the previous state-of-the-art by +1.92%, while also excelling on geometry-focused ModelNet40GS. These gains come with minimal increases in parameters and inference time, making GARNET an efficient unified solution for 3DGS-based recognition tasks. Our code is available at https://github.com/zlfffan/Code/tree/Garnet3DGS.
Standard LLM evaluation practices compress diverse abilities into single scores, obscuring their inherently multidimensional nature. We present JE-IRT, a geometric item-response framework that embeds both LLMs and questions in a shared space. For question embeddings, the direction encodes semantics and the norm encodes difficulty, while correctness on each question is determined by the geometric interaction between the model and question embeddings. This geometry replaces a global ranking of LLMs with topical specialization and enables smooth variation across related questions. Building on this framework, our experimental results reveal that out-of-distribution behavior can be explained through directional alignment, and that larger norms consistently indicate harder questions. Moreover, JE-IRT naturally supports generalization: once the space is learned, new LLMs are added by fitting a single embedding. The learned space further reveals an LLM-internal taxonomy that only partially aligns with human-defined subject categories. JE-IRT thus establishes a unified and interpretable geometric lens that connects LLM abilities with the structure of questions, offering a distinctive perspective on model evaluation and generalization.
Attributed multiplex networks are powerful representations of complex systems where nodes represent entities, their attributes represent the properties, and each type of interaction is modeled as a relationship (layer) in a network. To analyze these networks, it is crucial to find a meaningful representation of nodes, node attributes, and class labels into a joint low-dimensional space. To this end, we propose a Contrastive Joint Embedding approach for Multiple Networks, CJEMN, that employs negative sampling and pseudo-labeling to obtain a meaningful embedding of all information within an attributed multiplex network. To the best of our knowledge, this is the first approach that utilizes negative sampling and pseudo-labeling to jointly embed nodes, node attributes, and class labels of attributed multiplex networks in a low-dimensional space. In addition to using spectral embedding and homogeneity analysis, our method incorporates negative pairs as a new layer to enhance the representation of similarities and dissimilarities among nodes, attributes, and class labels. We run experiments on five real-world datasets to evaluate the performance of CJEMN. Our approach outperforms state-of-the-art methods for downstream tasks, such as node classification and clustering.
Knowledge Graphs (KGs), with their intricate hierarchies and semantic relationships, present unique challenges for graph representation learning, necessitating tailored approaches to effectively capture and encode their complex structures into useful numerical representations. The fractal-like nature of these graphs, where patterns repeat at various scales and complexities, requires specialized algorithms that can adapt and learn from the multi-level structures inherent in the data. This similarity to fractals requires methods that preserve the recursive detail of knowledge graphs while facilitating efficient learning and extraction of relational patterns. In this study, we explore the integration of similarity group with attention mechanisms to represent knowledge graphs in complex spaces. In our approach, SimE, we make use of the algebraic (bijection) and geometric (similarity) properties of the similarity transformations to enhance the representation of self-similar fractals in KGs. We empirically validate the capability of providing representations of bijections and similarities in benchmark KGs. We also conducted controlled experiments that captured one-to-one, one-to-many, and many-to-many relational patterns and studied the behavior of state-of-the-art models including the proposed SimE model. Because of the lack of benchmark fractal-like KG datasets, we created a set of fractal-like testbeds to assess the subgraph similarity learning ability of models. The observed results suggest that SimE captures the complex geometric structures of KGs whose statements satisfy these algebraic and geometric properties. In particular, SimE is competitive with state-of-the-art KG embedding models and is able to achieve high values of Hits@1. As a result, SimE is capable of effectively predicting correct links and ranking them with the highest ranks. SimE is publicly available on GitHub https://github.com/NIMI-research/SimE.
While raw cosine similarity in pretrained embedding spaces exhibits strong rank correlation with human judgments, anisotropy induces systematic miscalibration of absolute values: scores concentrate in a narrow high-similarity band regardless of actual semantic relatedness, limiting interpretability as a quantitative measure. Prior work addresses this by modifying the embedding space (whitening, contrastive fine tuning), but such transformations alter geometric structure and require recomputing all embeddings. Using isotonic regression trained on human similarity judgments, we construct a monotonic transformation that achieves near-perfect calibration while preserving rank correlation and local stability(98% across seven perturbation types). Our contribution is not to replace cosine similarity, but to restore interpretability of its absolute values through monotone calibration, without altering its ranking properties. We characterize isotonic calibration as an order-preserving reparameterization and prove that all order-based constructions (angular ordering, nearest neighbors, threshold graphs and quantile-based decisions) are invariant under this transformation.
Hyperspectral unmixing techniques face challenges of insufficient endmember separability and degradation of abundance spatial continuity in complex scenarios. Traditional methods, which often overlook the geometric structure information of the data and feature space, exhibit significant performance limitations in the presence of noise and microscopic mixing. To address these issues, this paper proposes a dual-graph manifold-regularized hyperspectral unmixing framework, which, for the first time, jointly embeds a data graph (modeling the similarity of pixel spatial distribution) and a feature graph (constraining the endmember spectral manifold structure) into the nonnegative matrix factorization (NMF) model. By jointly preserving the spatial-spectral geometric properties, the proposed approach achieves precise decoupling of endmembers and abundances. This method innovatively designs dual-graph Laplacian regularization terms, which simultaneously enhance the spatial smoothness of abundances and the spectral discriminability of endmembers within a unified optimization objective. An adaptive alternating optimization algorithm is developed to solve the resulting nonconvex problem. Experiments on both synthetic and real hyperspectral data demonstrate that the proposed method significantly outperforms state-of-the-art algorithms, providing a robust and physically interpretable unmixing paradigm for complex mixing scenarios.
Joint-Embedding Predictive Architectures (JEPA) learn view-invariant representations and admit projection-based distribution matching for collapse prevention. Existing approaches regularize representations towards isotropic Gaussian distributions, but inherently favor dense representations and fail to capture the key property of sparsity observed in efficient representations. We introduce Rectified Distribution Matching Regularization (RDMReg), a sliced two-sample distribution-matching loss that aligns representations to a Rectified Generalized Gaussian (RGG) distribution. RGG enables explicit control over expected $\ell_0$ norm through rectification, while preserving maximum-entropy up to rescaling under expected $\ell_p$ norm constraints. Equipping JEPAs with RDMReg yields Rectified LpJEPA, which strictly generalizes prior Gaussian-based JEPAs. Empirically, Rectified LpJEPA learns sparse, non-negative representations with favorable sparsity-performance trade-offs and competitive downstream performance on image classification benchmarks, demonstrating that RDMReg effectively enforces sparsity while preserving task-relevant information.
The significance of Temporal Knowledge Graphs (TKGs) in Artificial Intelligence (AI) lies in their capacity to incorporate time-dimensional information, support complex reasoning and prediction, optimize decision-making processes, enhance the accuracy of recommendation systems, promote multimodal data integration, and strengthen knowledge management and updates. This provides a robust foundation for various AI applications. To effectively learn and apply both static and dynamic temporal patterns for reasoning, a range of embedding methods and large language models (LLMs) have been proposed in the literature. However, these methods often rely on a single underlying embedding space, whose geometric properties severely limit their ability to model intricate temporal patterns, such as hierarchical and ring structures. To address this limitation, this paper proposes embedding TKGs into projective geometric space and leverages LLMs technology to extract crucial temporal node information, thereby constructing the 5EL model. By embedding TKGs into projective geometric space and utilizing Möbius Group transformations, we effectively model various temporal patterns. Subsequently, LLMs technology is employed to process the trained TKGs. We adopt a parameter-efficient fine-tuning strategy to align LLMs with specific task requirements, thereby enhancing the model's ability to recognize structural information of key nodes in historical chains and enriching the representation of central entities. Experimental results on five advanced TKG datasets demonstrate that our proposed 5EL model significantly outperforms existing models.
Vision-language models encode images and text in a joint space, minimizing the distance between corresponding image and text pairs. How are language and images organized in this joint space, and how do the models encode meaning and modality? To investigate this, we train and release sparse autoencoders (SAEs) on the embedding spaces of four vision-language models (CLIP, SigLIP, SigLIP2, and AIMv2). SAEs approximate model embeddings as sparse linear combinations of learned directions, or"concepts". We find that, compared to other methods of linear feature learning, SAEs are better at reconstructing the real embeddings, while also able to retain the most sparsity. Retraining SAEs with different seeds or different data diet leads to two findings: the rare, specific concepts captured by the SAEs are liable to change drastically, but we also show that commonly-activating concepts are remarkably stable across runs. Interestingly, while most concepts activate primarily for one modality, we find they are not merely encoding modality per se. Many are almost orthogonal to the subspace that defines modality, and the concept directions do not function as good modality classifiers, suggesting that they encode cross-modal semantics. To quantify this bridging behavior, we introduce the Bridge Score, a metric that identifies concept pairs which are both co-activated across aligned image-text inputs and geometrically aligned in the shared space. This reveals that even single-modality concepts can collaborate to support cross-modal integration. We release interactive demos of the SAEs for all models, allowing researchers to explore the organization of the concept spaces. Overall, our findings uncover a sparse linear structure within VLM embedding spaces that is shaped by modality, yet stitched together through latent bridges, offering new insight into how multimodal meaning is constructed.
Generating Scalable Vector Graphics (SVG) from natural language descriptions poses significant challenges due to the need for precise semantic understanding, structural consistency, and strict syntactic adherence. Existing models often struggle to balance these aspects effectively. This paper proposes SVGGemma-Tuner, a fine-tuning framework that integrates structured instruction embedding to enhance geometric semantic comprehension, a dual-stage decoding architecture to separate layout planning from SVG token generation, and a syntax-aware reinforcement module to optimize syntactic validity through reinforcement learning. By jointly optimizing sequence prediction, spatial alignment, and syntax compliance, SVGGemma-Tuner demonstrates superior performance over existing approaches in generating coherent, semantically accurate, and syntactically valid SVG outputs.
The study of neural representations, both in biological and artificial systems, is increasingly revealing the importance of geometric and topological structures. Inspired by this, we introduce Event2Vec, a novel framework for learning representations of discrete event sequences. Our model leverages a simple, additive recurrent structure to learn composable, interpretable embeddings. We provide a theoretical analysis demonstrating that, under specific training objectives, our model's learned representations in a Euclidean space converge to an ideal additive structure. This ensures that the representation of a sequence is the vector sum of its constituent events, a property we term the linear additive hypothesis. To address the limitations of Euclidean geometry for hierarchical data, we also introduce a variant of our model in hyperbolic space, which is naturally suited to embedding tree-like structures with low distortion. We present experiments to validate our hypothesis. Quantitative evaluation on the Brown Corpus yields a Silhouette score of 0.0564, outperforming a Word2Vec baseline (0.0215), demonstrating the model's ability to capture structural dependencies without supervision.
Vision-language models (VLMs) enable open-ended visual question answering but remain prone to hallucinations. We present HEDGE, a unified framework for hallucination detection that combines controlled visual perturbations, semantic clustering, and robust uncertainty metrics. HEDGE integrates sampling, distortion synthesis, clustering (entailment- and embedding-based), and metric computation into a reproducible pipeline applicable across multimodal architectures. Evaluations on VQA-RAD and KvasirVQA-x1 with three representative VLMs (LLaVA-Med, Med-Gemma, Qwen2.5-VL) reveal clear architecture- and prompt-dependent trends. Hallucination detectability is highest for unified-fusion models with dense visual tokenization (Qwen2.5-VL) and lowest for architectures with restricted tokenization (Med-Gemma). Embedding-based clustering often yields stronger separation when applied directly to the generated answers, whereas NLI-based clustering remains advantageous for LLaVA-Med and for longer, sentence-level responses. Across configurations, the VASE metric consistently provides the most robust hallucination signal, especially when paired with embedding clustering and a moderate sampling budget (n ~ 10-15). Prompt design also matters: concise, label-style outputs offer clearer semantic structure than syntactically constrained one-sentence responses. By framing hallucination detection as a geometric robustness problem shaped jointly by sampling scale, prompt structure, model architecture, and clustering strategy, HEDGE provides a principled, compute-aware foundation for evaluating multimodal reliability. The hedge-bench PyPI library enables reproducible and extensible benchmarking, with full code and experimental resources available at https://github.com/Simula/HEDGE .
Deep learning has achieved state-of-the-art video action recognition (VAR) performance by comprehending action-related features from raw video. However, these models often learn to jointly encode auxiliary view (viewpoints and sensor properties) information with primary action features, leading to performance degradation under novel views and security concerns by revealing sensor types and locations. Here, we systematically study these shortcomings of VAR models and develop a novel approach, VIVAR, to learn view-invariant spatiotemporal action features removing view information. In particular, we leverage contrastive learning to separate actions and jointly optimize adversarial loss that aligns view distributions to remove auxiliary view information in the deep embedding space using the unlabeled synchronous multiview (MV) video to learn view-invariant VAR system. We evaluate VIVAR using our in-house large-scale time synchronous MV video dataset containing 10 actions with three angular viewpoints and sensors in diverse environments. VIVAR successfully captures view-invariant action features, improves inter and intra-action clusters’ quality, and outperforms SoTA models consistently with 8% more accuracy. We additionally perform extensive studies with our datasets, model architectures, multiple contrastive learning, and view distribution alignments to provide VIVAR insights. We open-source our code and dataset to facilitate further research in view-invariant systems.
This paper presents a dual-modal instance segmentation framework based on Spatial Axial Band Attention (SABA), for electrical distribution scenarios. To address challenges of edge ambiguity, thermal heterogeneity, and small-target omission in single-modal methods, we propose: 1) A dual-stream feature pyramid with non-isotropic band partitioning, decomposing global attention into orthogonal local 2) A cross-scale guided aggregation mechanism to resolve intensity inhomogeneity caused by uneven heat dissipation. Experimental results on our established electrical distribution dataset demonstrate leading performance with 62.0 mAP@0.5 and 56.3% mask IoU, surpassing single-modal baselines by 3.6-9.5% absolute gains. This lightweight architecture achieves lower complexity of 79.68 GFlops than some compared dual modal methods.
Dynamic Contrast-Enhanced Magnetic Resonance Imaging (DCE-MRI) is a medical imaging technique that plays a crucial role in the detailed visualization and identification of tissue perfusion in abnormal lesions and radiological suggestions for biopsy. However, DCE-MRI involves the administration of a Gadolinium-based (Gad) contrast agent, which is associated with a risk of toxicity in the body. Previous deep learning approaches that synthesize DCE-MR images employ unimodal non-contrast or low-dose contrast MRI images lacking focus on the local perfusion information within the anatomy of interest. We propose AAD-DCE, a generative adversarial network (GAN) with an aggregated attention discriminator module consisting of global and local discriminators. The discriminators provide a spatial embedded attention map to drive the generator to synthesize early and late response DCE-MRI images. Our method employs multimodal inputs - T2 weighted (T2W), Apparent Diffusion Coefficient (ADC), and T1 pre-contrast for image synthesis. Extensive comparative and ablation studies on the ProstateX dataset show that our model (i) is agnostic to various generator benchmarks and (ii) outperforms other DCE-MRI synthesis approaches with improvement margins of +0.64 dB PSNR, +0.0518 SSIM, -0.015 MAE for early response and +0.1 dB PSNR, +0.0424 SSIM, -0.021 MAE for late response, and (ii) emphasize the importance of attention ensembling. Our code is available at https://github.com/bhartidivya/AAD-DCE
Short-term precipitation forecasting is an important part of modern weather prediction systems. Using richer observational data can improve the accuracy of rainfall predictions. Current methods often rely on single-mode data, such as radar echo data, which cannot fully capture the many factors affecting rainfall changes. This limits the accuracy and timeliness of predictions. To address this issue, we propose a novel Multimodal Data Attention Fusion Network (MAFN) that combines radar echo and wind speed data, leveraging their complementary strengths to improve prediction accuracy. Specifically, MAFN includes a dual-stream encoder to separately extract spatiotemporal and sequential features, effectively combining spatial information on rainfall with the dynamic processes of its movement. It also features an Attention Fusion Module (AFM) to align and merge the feature information, and a decoder to generate rainfall map. Experiments on the ERA5 dataset show that MAFN outperforms existing models, achieving higher accuracy and robustness in precipitation forecasting.
Multimodal biometric authentication is a robust security mechanism designed to enhance the reliability and security of user-authentication systems. It integrates several biometric traits to provide an accurate and secure identification system. However, existing Deep Learning (DL) models struggle to capture both spatial and contextual dependencies across multiple biometric traits, which reduces their robustness against spoofing attacks. Hence, this paper proposes a convolutional neural network, swin transformer, multi-head self-attention, and global max pooling (CSMG) for effective feature extraction, strengthening both spatial and contextual representation, and reducing redundancy across modalities. Next, the classification head uses the extracted feature map and softmax layer to predict a person’s identity. An effective fusion strategy was introduced to integrate fingerprint, iris, and ECG signals, utilizing their complementary strengths to mitigate spoofing attacks. The performance of the proposed CSMG method was evaluated using the IITD Iris, SOCOfing fingerprint, and HEARTPRINT ECG datasets. The experimental evaluation demonstrates that the proposed CSMG method achieves a recognition accuracy of 99.90% for fingerprints, 99% for irises, and 99% for ECG compared to traditional models.
This manuscript proposes a Multimodal input Residual Encoder-Decoder Network (MultimodalInputRED-CNN) leveraging complementary information from CT images and iodine maps to enhance pediatric CT image denoising while preserving diagnostically critical details. The MultimodalInputRED-CNN featured a Multimodal input residual encoder-decoder network with three key components: (1) a specialized dual-branch encoder that processes CT images with 5 $\times 5$ convolution kernels for noise suppression and iodine maps with $3 \times 3$ kernels for detail preservation; (2) an innovative crossattention mechanism that computes channel and spatial attention weights to adaptively fuse features based on local image characteristics; (3) a noise confidence assessment module that dynamically adjusts the influence of iodine map features according to noise conditions. The method was evaluated on pediatric lower limb CT images under various noise scenarios. Under Gaussian noise conditions, the MultimodalInputRED-CNN achieved a PSNR of $33.79 \pm 1.31 ~\text{dB}$ and SSIM of $0.9560 \pm 0.0163$, outperforming NAFNet by $\mathbf{1. 2 3 ~ d B}$. For mixed noise, our LPIPS score of $0.0349 \pm 0.0037$ represented a 36.4 % improvement over EDCNN. Ablation studies showed removing iodine map input resulted in a $\mathbf{1. 7 6 ~ d B}$ PSNR drop. The MultimodalInputRED-CNN overcomes limitations of traditional single-input methods when processing complex anatomical regions, providing a new technique for improving pediatric low-dose CT imaging with potential clinical value.
Abstract. Sea surface height (SSH) is an important parameter in oceanographic studies and is crucial for understanding oceanic and atmospheric processes. Traditional physical models simply compress two-dimensional delay-Doppler map (DDM) data into a single scalar message, resulting in the loss of critical information and the need for error modeling for correction. First, utilizing the reflected signals from the BeiDou Navigation Satellite System (BDS) provided by the FY-3E GNSS Radio Occultation Sounder-II (GNOS-II), a novel multimodal deep learning model integrating self-attention mechanisms and residual networks—termed Vision Transformer Residual Network Multimodal Deep Learning (ViTResNetMDL)—is proposed to retrieve sea surface height. ViTResNetMDL captures the global spatial features in the effective scattering area DDM through the self-attention mechanism in the Transformer module, while extracting the local detail features of the original DDM using the residual structure and combining with the auxiliary parameters for the multimodal data fusion, and finally inverts the SSH through four fully connected layers. Second, to validate the proposed improvements of the model, the inversion results are corrected by cumulative distribution function (CDF) matching, and the DTU18 Global Sea Surface Height Validation Model data, which has been corrected by the DTU23 Global Ocean Tide Model, is used as a reference to conduct extensive tests, and the results show that the SSH inversion results of the ViTResNetMDL model have a root mean square error (RMSE) of 0.74 m, a mean absolute error (MAE) of 0.48 m, and a coefficient of determination (R2) of 1.00. Third, compared with the inversion results based on the derivative polar tracking method, the RMSE, MAE, and R2 of the ViTResNetMDL model inversion results are improved by 80.9%, 80.1%, and 16.3%, respectively. The ViTResNetMDL model will provide a new theoretical and methodological reference for the means of inversion for GNSS-R altimetry.
The Disaster Response Mapping (DRM) is a relevant process that integrates information so that it becomes feasible to draw maps on the affected regions by calamities in real-time using the multisource data acquisition imaging viewers to realize the vision of sufficient and timely decision-making concerning the emergency operations. Multimodal Image Fusion Using Guided Attention (MM-IFGA) is suggested to maximize the DRM by image fusion between optical, infrared, and Synthetic Aperture Radar (SAR) images to create one high-quality and single image. Multimodal appreciations are also synchronized and the algorithm can pre-process such appreciations to provide spatial coherence and noise reduction. Spatial, spectral and structural features of the two modalities are analyzed with CNNs, and disasterrelevant region is highlighted with a guided attention mechanism. The weighting of attention-enhanced features is then performed and at last, deep features are concatenated and decoded to a high-resolution output map retains information with structural and spectral integrity. The other concern of the MM-IFGA structure is the heterogeneity, noise and redundancy of information of the multimodal disaster information and allows the emergency teams to maximize the resource distribution, the targeting of interventions and the optimization of the recovery planning. Experimental comparison of the discussed methods demonstrates the achievements of the promoted method with greater output in the fields of image quality and PSNR (33.1), structural similarity, and the overall visual faithfulness, therefore, providing the suitable and timely critique of off-the-scale calamities. These findings demonstrate that MM-IFGA is scalable and sound to be considered an intelligent and real-time approach to mapping data disasters that can be used to address certain events like floods, wild fires and structural damage investigations.
Gait recognition, as a long-distance and non-invasive biometric identification technology, has significant application value in the fields of public security and intelligent monitoring. However, problems such as perspective changes, clothing alterations and occlusion existing in real-world scenarios have seriously affected the robustness of traditional silhouette-based methods. Although skeleton-based methods have natural robustness to apparent changes, their recognition performance is limited by the accuracy of pose estimation, and existing methods lack dynamic adaptive capabilities in multimodal fusion. To this end, a multimodal gait recognition method named SkeletonGait-CPCA is proposed. This method first converts the human body joint point coordinates into a structured skeleton heat map to construct the baseline model SkeletonGait; On this basis, an innovative channel prior convolution attention module is introduced. Through the parallel channel and spatial attention mechanism, the adaptive fusion of silhouette and skeleton features is achieved. Experiments on the real-world gait dataset Gait3D show that after the introduction of the CPCA module, SkeletonGait-CPCA further improves in the three metrics of rank-1, rank-5 and mAP, reaching 78.2%, 90.4% and 72.4% respectively. The superiority and robustness of the proposed method in complex scenarios have been verified.
LiDAR, radar, and cameras are widely used in autonomous driving systems, but each modality has inherent limitations, such as LiDAR's sensitivity to adverse weather, radar's low spatial resolution, and cameras' dependence on lighting conditions. To address these challenges, this study proposes a novel multi-modal fusion framework that integrates these sensors to enhance object detection accuracy and robustness. A cross-modal feature alignment strategy ensures spatial and semantic consistency across sensor data, while an attention-based mechanism dynamically adjusts the contributions of each modality based on their reliability in different scenarios. Experimental results on the KITTI and nuScenes datasets show that the framework achieves a mean Average Precision (mAP) of 89.4% while maintaining real-time efficiency at 36.8 FPS. Compared to single-modality baselines and traditional fusion methods, the proposed framework demonstrates superior detection performance, particularly in scenarios involving occlusion, low-light conditions, and dense traffic. Ablation studies validate the effectiveness of the cross-modal alignment and attention mechanisms, highlighting their critical roles in achieving robust detection. The proposed framework offers a scalable and efficient solution for autonomous driving systems, effectively addressing the limitations of single-modality sensors.
To address the issue of low accuracy in object detection for autonomous driving, we propose an attention-enhanced multi-modal fusion three-dimensional object detection method (EA-BEV). This method incorporates a self-attention mechanism in the image processing network, which effectively extracts deep features and reduces the problem of insufficient image feature extraction caused by semantic information blurriness. In the point cloud processing network, we designed a high-order convolutional spatial attention mechanism that significantly enhances the network's ability to model and express non-linear deep features of point clouds, thereby improving the global descriptive capability of point cloud information. We conducted comparative experiments on the nuScenes dataset, and the results show that the mAP metric is 76.2% and the NDS metric is 74.4%. The EA-BEV method demonstrates a clear advantage in the precision of 3D object detection.
In autonomous driving, trajectory prediction is essential for safe and efficient navigation. While recent methods often rely on high-definition (HD) maps to provide structured environmental priors, such maps are costly to maintain, geographically limited, and unreliable in dynamic or unmapped scenarios. Directly leveraging raw sensor data in Bird's-Eye View (BEV) space offers greater flexibility, but BEV features are dense and unstructured, making agent-centric spatial reasoning challenging and computationally inefficient. To address this, we propose Bird's-Eye View Trajectory Prediction (BEVTraj), a map-free framework that employs deformable attention to adaptively aggregate task-relevant context from sparse locations in dense BEV features. We further introduce a Sparse Goal Candidate Proposal (SGCP) module that predicts a small set of realistic goals, enabling fully end-to-end multimodal forecasting without heuristic post-processing. Extensive experiments show that BEVTraj achieves performance comparable to state-of-the-art HD map-based methods while providing greater robustness and flexibility without relying on pre-built maps. The source code is available at https://github.com/Kongminsang/bevtraj.
Trajectory prediction is a crucial task for autonomous driving, but current models’ reliance on high-definition (HD) maps limits their broader applicability. To cope with this challenge, we propose a novel map-free trajectory prediction method that leverages spatiotemporal attention mechanisms. The method consists of three key stages: 1) we first encode spatial and temporal features separately using spatial and temporal attention mechanisms, 2) we then model spatial and temporal interactions through Crystal Graph Convolutional Networks (CGCN) and Multi-Head Attention (MHA), 3) finally, an adaptive anchor generation technique is introduced to tackle the multimodal trajectory prediction challenge. This self-adaptive technique generates context-specific anchors, enabling accurate prediction of multiple possible future vehicle trajectories. Extensive experiments on the Argoverse1 and V2X-Seq datasets validate the effectiveness of our approach. On the Argoverse1 dataset, our method outperforms CRAT-Pred by 5.8% in minADE and 6.25% in minFDE. On the V2X-Seq dataset, it achieves improvements of 82.6%, 85.1%, and 44.0% in minADE, minFDE, and MR, respectively, compared to the baseline model.
Accurate segmentation of gliomas is crucial for diagnosis, treatment planning, and prognostic assessment. However, existing multimodal MRI segmentation methods are limited by inadequate information fusion, particularly when addressing significant tumor scale variations. To address these challenges, we present VAMFNet, a V-Net-based architecture comprising three coordinated modules. AMF performs voxel-wise, modalityadaptive fusion via a spatial attention map, enabling the network to assign dynamic weights to each MRI sequence. MFF aggregates multiscale context by employing parallel 3D dilated convolutions and cross-stage feature fusion, effectively handling large-scale variations. ConBlock3D + 3D-CBAM refines representations with channel and spatial attention and residual connections to sharpen boundaries. On the BraTS 2019 test set (with the model trained on BraTS 2020), VAMF-Net outperforms several advanced baselines (mean Dice: 0.910, HD95: 3.03, ET boundary HD95: 1.80), and ablation studies highlight the complementary contributions of the three modules. This study provides an efficient solution for multimodal medical image segmentation, with strong potential for clinical application.
Human action recognition (HAR) systems are foundational for mobile educational technologies, such as gesture-based learning analytics and remote skill acquisition. However, current systems often fail in real-world settings due to visual occlusion and the neglect of the rich contextual information provided by the acoustic modality, particularly in visual-centric datasets such as NTU RGB+D 60 and MSR Daily Activity 3D. By manually producing action-relevant audio streams for these datasets, we propose a multimodal approach that fuses skeleton and audio modalities through a cross-attention mechanism. Our framework processes skeleton data by integrating joints and limbs into an H × W × 31 spatial feature map, which is then fed into a ResNet50 backbone. Log-Mel spectrograms are encoded using a ConvNeXt-T architecture. A cross-attention mechanism is employed to fuse these features, effectively learning inter-modal dependencies. Evaluations demonstrate significant gains: 94.7% on NTU RGB+D X-SUB (up from 90.5% using only skeleton data) and 97.9% on MSR Daily Activity 3D (compared to 89.8%). These results quantitatively establish the critical role of audio in enabling robust, real-time feedback loops that are essential for smart learning environments and interactive mobile coaching, where visual data alone is unreliable.
A multimodal adaptive action recognition method based on attention distillation is proposed to address irrelevant information interference in multimodal feature extraction and insufficient modal complementarity/consistency in fusion. First, a dual-branch FPN processes RGB and depth map data in parallel for multi-scale feature extraction, while skeletal data undergoes skeleton graph modeling. A feature decoupling module and attention distillation loss function are designed to optimize via loss minimization backpropagation, enhancing task-related feature extraction and reducing noise-induced redundancy. In fusion, a Transformer uses its Encoder-Decoder structure to adaptively adjust multimodal fusion weights. Finally, a spatiotemporal graph convolutional network performs spatiotemporal modeling and classification on enhanced skeleton graph features, capturing human joint spatial correlations and movement temporal dynamics to improve classification accuracy. Experiments on NTU RGB-D60 and NTU RGB-D120 datasets show the method outperforms benchmark algorithms, verifying its effectiveness in action recognition.
Accurate environmental perception is fundamental to safe autonomous driving; however, most existing multimodal systems rely on fixed or heuristic sensor fusion strategies that cannot adapt to scene-dependent variations in sensor reliability. This paper proposes Cross-Modal Adaptive Attention (CMAA), a unified end-to-end Bird’s-Eye-View (BEV) perception framework that dynamically fuses camera, LiDAR, and RADAR information through learnable, context-aware modality gating. Unlike static fusion approaches, CMAA adaptively reweights sensor contributions based on global scene descriptors, enabling the robust integration of semantic, geometric, and motion cues without manual tuning. The proposed architecture jointly performs 3D object detection, multi-object tracking, and motion forecasting within a shared BEV representation, preserving spatial alignment across tasks and supporting efficient real-time deployment. Experiments conducted on the official nuScenes validation split demonstrate that CMAA achieves 0.528 mAP and 0.691 NDS, outperforming fixed-weight fusion baselines while maintaining a compact model size and efficient inference. Additional tracking evaluation using the official nuScenes tracking devkit reports improved tracking performance, while motion forecasting experiments show reduced trajectory displacement errors (minADE and minFDE). Ablation studies further confirm the complementary contributions of adaptive modality gating and bidirectional cross-modal refinement, and a stratified dynamic analysis reveals consistent reductions in velocity estimation error across object classes, motion regimes, and environmental conditions. These results demonstrate that adaptive multimodal fusion improves robustness, motion reasoning, and perception reliability in complex traffic environments while remaining computationally efficient for deployment in safety-critical autonomous driving systems.
Large Vision-Language Models (LVLMs) exhibit impressive multimodal reasoning capabilities but remain highly susceptible to object hallucination, where models generate responses that are not factually aligned with the visual content. Recent works attribute this issue to an inherent bias of LVLMs where vision token attention map has a fixed correlation with spatial position, and propose to mitigate this issue by reordering visual tokens. However, we find that different LVLMs exhibit different correlations between attention and spatial position, which makes the existing solution difficult to generalize to other LVLMs. To address this issue, we first introduce a training-free solution, Uniform Attention Calibration (UAC), that estimates the bias from single meaningless input image and applies a calibration matrix to rectify attention imbalances. To further alleviate the bias, we relax the assumption of single meaningless input in UAC and introduce a fine-tuning solution, Dynamic Attention Calibration (DAC), that enforces the consistent outputs wherever the object locates in the image via a plug-and-plays module. Comprehensive experiments across multiple benchmarks demonstrate that UAC and DAC significantly reduce object hallucination while improving general multimodal alignment. Our methods achieve state-of-the-art performance across diverse LVLM architectures on various metrics.
The rapid evolution of deep learning has dramatically enhanced the field of medical image segmentation, leading to the development of models with unprecedented accuracy in analyzing complex medical images. Deep learning-based segmentation holds significant promise for advancing clinical care and enhancing the precision of medical interventions. However, these models’ high computational demand and complexity present significant barriers to their application in resource-constrained clinical settings. To address this challenge, we introduce Teach-Former, a novel knowledge distillation (KD) framework that leverages a Transformer backbone to effectively condense the knowledge of multiple teacher models into a single, streamlined student model. Moreover, it excels in the contextual and spatial interpretation of relationships across multimodal images for more accurate and precise segmentation. Teach-Former stands out by harnessing multimodal inputs (CT, PET, MRI) and distilling the final predictions and the intermediate attention maps, ensuring a richer spatial and contextual knowledge transfer. Through this technique, the student model inherits the capacity for fine segmentation while operating with a significantly reduced parameter set and computational footprint. Additionally, introducing a novel training strategy optimizes knowledge transfer, ensuring the student model captures the intricate mapping of features essential for high-fidelity segmentation. The efficacy of Teach-Former has been effectively tested on two extensive multimodal datasets, HECKTOR21 and PI-CAI22, encompassing various image types. The results demonstrate that our KD strategy reduces the model complexity and surpasses existing state-of-the-art methods to achieve superior performance. The findings of this study indicate that the proposed methodology could facilitate efficient segmentation of complex multimodal medical images, supporting clinicians in achieving more precise diagnoses and comprehensive monitoring of pathological conditions (https://github.com/FarihaHossain/TeachFormer).
Recent advancements in text-to-video (T2V) diffusion models have significantly enhanced the visual quality of the generated videos. However, even recent T2V models find it challenging to follow text descriptions accurately, especially when the prompt requires accurate control of spatial layouts or object trajectories. A recent line of research uses layout guidance for T2V models that require fine-tuning or iterative manipulation of the attention map during inference time. This significantly increases the memory requirement, making it difficult to adopt a large T2V model as a backbone. To address this, we introduce Video-MSG, a training-free Guidance method for T2V generation based on Multimodal planning and Structured noise initialization. Video-MSG consists of three steps, where in the first two steps, Video-MSG creates Video Sketch, a fine-grained spatio-temporal plan for the final video, specifying background, foreground, and object trajectories, in the form of draft video frames. In the last step, Video-MSG guides a downstream T2V diffusion model with Video Sketch through noise inversion and denoising. Notably, Video-MSG does not need fine-tuning or attention manipulation with additional memory during inference time, making it easier to adopt large T2V models. Video-MSG demonstrates its effectiveness in enhancing text alignment with multiple T2V backbones (VideoCrafter2 and CogVideoX-5B) on popular T2V generation benchmarks (T2VCompBench and VBench). We provide comprehensive ablation studies about noise inversion ratio, different background generators, background object detection, and foreground object segmentation.
Accurate road network extraction from remote sensing images (RSIs) is essential for applications such as urban planning, map updates, and autonomous navigation. However, challenges such as complex backgrounds, varying spatial resolutions, and occlusions hinder traditional single-modality approaches, which often fail to capture comprehensive contextual information. To address these limitations, we propose the DEFNet, a novel dual-layer evidential fusion network for robust multimodal road extraction. The DEFNet features two key modules: cross-attention feature interaction (CAFI) and dual-layer evidential fusion (DEF). The CAFI module facilitates adaptive multimodal interaction at both pixel and superpixel levels, enhancing feature fusion while mitigating noise. The DEF module, leveraging the Dirichlet framework and Dempster–Shafer theory (DST), performs uncertainty-aware fusion, improving prediction reliability and robustness. Extensive experiments on multiple benchmark datasets demonstrate that the DEFNet consistently outperforms state-of-the-art methods in both accuracy and robustness, making it highly effective for multimodal road extraction in remote sensing applications. The codes can be downloaded from https://github.com/BeechburgPieStar/DEFNet
Multimodal fusion technology significantly enhances the safety and perception capabilities of intelligent vehicles. Recently, replacing Cartesian coordinate system voxels with polar voxels in 3D perception tasks has significantly improved spatial occupancy rates and adaptability. However, the uneven distribution of voxels introduces new challenges: feature information distortion and reduced real-time performance. This paper proposes a multimodal fusion network based on polar graphs to address these issues. Raw data from LiDAR, cameras, and millimeter-wave (MMW) radar are initially preprocessed, and point-graph and voxel-graph structures in polar coordinates are constructed. Subsequently, using Graph Attention Networks (GAT), features are extracted and aggregated at multiple levels, forming a polar-based Bird's Eye View (BEV) feature map. At the BEV level, multimodal features are fused, and multi-scale features are aggregated using multi-scale GAT, culminating in the design of a polar-based CenterHead to complete the 3D perception task. Extensive experiments conducted on the nuScenes dataset and real vehicle test data have demonstrated that the detection precision (70.5% mAP) and inference speed (12.6 Hz) of the model's surpass those of comparative models, establishing a new state-of-the-art (SOTA). Additionally, the model exhibits high levels of perception accuracy, robustness, and generalizability across various real vehicle scenarios.
Effectively segmenting brain tumors from multimodal 3D MRI scans presents a formidable challenge due to the heterogeneous nature of tumor structures and the inherent modality imbalance. This study presents the Enhanced Region-Aware Fusion Network, an end-to-end architecture that integrates spatial-probabilistic reasoning with adaptive modality fusion for volumetric medical image segmentation. The Probability Map Estimator is the fundamental module of this study. It generates region confidence maps, guiding an Enhanced Region-Aware Fusion Module to study dynamic attention weights across various MRI modalities. Then the fused representations are further polished by a shared decoder to enable precise tumor description. Enhanced Region-Aware Fusion Network simultaneously minimizes both Binary Cross-Entropy and Dice losses, thereby enhancing sensitivity at tumor boundaries and effectively addressing label imbalance. We evaluated our model on the BRA TS2020 dataset. Results show that Enhanced Region-aware Fusion Network surpasses the conventional multimodal fusion approaches while maintaining spatial consistency within tumor regions. Enhanced Region-Aware Fusion Network highlights its promise for clinical MRI tumor diagnosis.
In multi-energy systems, the continuous expansion of sensor networks and the rapid growth of data dimensions make the traditional single-modality-based anomaly detection strategy difficult to adapt to the challenges of multi-source heterogeneity and potential correlation brought by complex energy consumption scenarios. According to the theory of multimodal fusion algorithms, the Correlation-Driven Multi-Level Multimodal model in this work deeply executes cross-modal collaborative learning and multi-level semantics representation for the data feature of multiple energy sources: Firstly, utilize the multi-channel attention mechanism to mine the correlation between energy consumption data in different modals to realize the implementation of the weight attention of key features. Then, the hierarchical embedding network was used to map the multi-modal features to the high-dimensional unified semantic space, the temporal information and spatial relations were captured through different levels. Moreover, we construct a composite graph relationship which covers multi-source data and the potential interaction in between by combining graph structure learning and adversarial training, to enhance the accuracy and robustness of abnormal features capturing, to realize accurate detection of complex anomalous patterns in multi-energy systems. The experiment results demonstrate that, compared to the previous mainstream methods, the proposed model has a higher detection accuracy and a stronger generalization ability on the multi-source energy datasets.
No abstract available
This study addresses the challenges of feature representation and computational complexity in small facial acne detection by proposing a multi-modal knowledge distillation method. An enhanced YOLOv8s-based teacher model is proposed, integrating a transformer architecture into the backbone to improve global feature capture and a novel multi-scale attention mechanism in the neck network to enhance small acne feature representation. The student model is a lightweight version of the teacher model, obtained by replacing standard convolution with depthwise separable convolution. Additionally, a multi-modal distillation approach leveraging spatial, channel, and correlation information is proposed to overcome limitations of single-information distillation, enabling efficient training of student models. Experiments show that the proposed method achieves mAP of 31.52% on ACNE4K and 24.1% on ACNE04, with 7.83M parameters and 61.06G FLOPs, reducing parameters by 45% and computational costs by 50% compared to the teacher model while maintaining comparable accuracy.
The detection of micro-inclusions and the representation of interpretable results compatible with gemological standards are two major bottlenecks to the automation of diamond clarity grading. In this work, we propose a novel solution using an enhanced YOLOv7 model in a multimodal framework to overcome the above bottlenecks. Specifically, our contributions are threefold: Firstly, we improve the original YOLOv7 by incorporating the Efficient Channel Attention (ECA) mechanism to enhance the extracted fine-grained features and the Adaptively Spatial Feature Fusion (ASFF) module for capturing more robust multi-scale representations; secondly, we build a three-channel input consisting of optical grayscale, gray-level co-occurrence matrix (GLCM) texture linked with the optical properties of inclusions, and morphological operation-enhanced images. Thirdly, we design a traceable grading system by combining the XGBoost classifier with programmable GIA (Gemological Institute of America) rules. Our method achieves 91.3% mAP@0.5 on the Roboflow Diamond Inclusion dataset, outperforming the baseline YOLOv7 by 9.2%. The clarity grading performance on this dataset attains an accuracy of 86.7%, a Kappa coefficient of 0.82, and a weighted F1-score of 0.87, resulting in high consistency with human expert evaluations. Ablation experiments confirm that the proposed components all make individual and complementary contributions. This work represents a significant advance towards the automatic, accurate, and interpretable grading of diamonds, and it creates a practical tool for use within the jewelry industry.
Multimodal Large Language Models (MLLMs) show strong performance in Visual Question Answering (VQA) but remain limited in fine-grained reasoning due to low-resolution inputs and noisy attention aggregation. We propose \textbf{Head Aware Visual Cropping (HAVC)}, a training-free method that improves visual grounding by leveraging a selectively refined subset of attention heads. HAVC first filters heads through an OCR-based diagnostic task, ensuring that only those with genuine grounding ability are retained. At inference, these heads are further refined using spatial entropy for stronger spatial concentration and gradient sensitivity for predictive contribution. The fused signals produce a reliable Visual Cropping Guidance Map, which highlights the most task-relevant region and guides the cropping of a subimage subsequently provided to the MLLM together with the image-question pair. Extensive experiments on multiple fine-grained VQA benchmarks demonstrate that HAVC consistently outperforms state-of-the-art cropping strategies, achieving more precise localization, stronger visual grounding, providing a simple yet effective strategy for enhancing precision in MLLMs.
Large multimodal models (LMMs) excel in scene understanding but struggle with fine-grained spatiotemporal reasoning due to weak alignment between linguistic and visual representations. Existing methods map textual positions and durations into the visual space encoded from frame-based videos, but suffer from temporal sparsity that limits language-vision temporal coordination. To address this issue, we introduce LLaFEA (Large Language and Frame-Event Assistant) to leverage event cameras for temporally dense perception and frame-event fusion. Our approach employs a cross-attention mechanism to integrate complementary spatial and temporal features, followed by self-attention matching for global spatio-temporal associations. We further embed textual position and duration tokens into the fused visual space to enhance fine-grained alignment. This unified framework ensures robust spatio-temporal coordinate alignment, enabling LMMs to interpret scenes at any position and any time. In addition, we construct a dataset of real-world frames-events with coordinate instructions and conduct extensive experiments to validate the effectiveness of the proposed method.
Despite recent advances in diffusion models, top-tier text-to-image (T2I) models still struggle to achieve precise spatial layout control, i.e. accurately generating entities with specified attributes and locations. Segmentation-mask-to-image (S2I) generation has emerged as a promising solution by incorporating pixel-level spatial guidance and regional text prompts. However, existing S2I methods fail to simultaneously ensure semantic consistency and shape consistency. To address these challenges, we propose Seg2Any, a novel S2I framework built upon advanced multimodal diffusion transformers (e.g. FLUX). First, to achieve both semantic and shape consistency, we decouple segmentation mask conditions into regional semantic and high-frequency shape components. The regional semantic condition is introduced by a Semantic Alignment Attention Mask, ensuring that generated entities adhere to their assigned text prompts. The high-frequency shape condition, representing entity boundaries, is encoded as an Entity Contour Map and then introduced as an additional modality via multi-modal attention to guide image spatial structure. Second, to prevent attribute leakage across entities in multi-entity scenarios, we introduce an Attribute Isolation Attention Mask mechanism, which constrains each entity's image tokens to attend exclusively to themselves during image self-attention. To support open-set S2I generation, we construct SACap-1M, a large-scale dataset containing 1 million images with 5.9 million segmented entities and detailed regional captions, along with a SACap-Eval benchmark for comprehensive S2I evaluation. Extensive experiments demonstrate that Seg2Any achieves state-of-the-art performance on both open-set and closed-set S2I benchmarks, particularly in fine-grained spatial and attribute control of entities.
Multimodal large language models (MLLMs) trained with visual instruction tuning have achieved strong performance across diverse tasks, yet they remain limited in vision-centric tasks such as object counting or spatial reasoning. We attribute this gap to the prevailing text-only supervision paradigm, which provides only indirect guidance for the visual pathway and often leads MLLMs to discard fine-grained visual details during training. In this paper, we present VIsual Representation ALignment (VIRAL), a simple yet effective regularization strategy that aligns the internal visual representations of MLLMs with those of pre-trained vision foundation models (VFMs). By explicitly enforcing this alignment, VIRAL enables the model not only to retain critical visual details from the input vision encoder but also to complement additional visual knowledge from VFMs, thereby enhancing its ability to reason over complex visual inputs. Our experiments demonstrate consistent improvements across all tasks on widely adopted multimodal benchmarks. Furthermore, we conduct comprehensive ablation studies to validate the key design choices underlying our framework. We believe this simple finding opens up an important direction for the effective integration of visual information in training MLLMs.
Large vision-language models (VLMs) show strong multimodal understanding but still struggle with 3D spatial reasoning, such as distance estimation, size comparison, and cross-view consistency. Existing 3D-aware methods either depend on auxiliary 3D information or enhance RGB-only VLMs with geometry encoders through shallow feature fusion. We propose SpaceMind, a multimodal large language model explicitly designed for spatial reasoning solely from RGB inputs. The model adopts a dual-encoder architecture, integrating VGGT as a spatial understanding encoder and InternViT as a 2D visual encoder. The key idea is to treat the camera representation as an active guiding modality rather than passive metadata. Specifically, SpaceMind introduces a lightweight Camera-Guided Modality Fusion module before the language model to replace shallow fusion. It applies camera-conditioned biasing to spatial tokens, assigns query-independent weights reflecting their geometric importance, and uses the camera embedding to gate the fused representation. Empirically, SpaceMind establishes new state-of-the-art results on VSI-Bench, SQA3D and SPBench, surpassing both open and proprietary systems on VSI-Bench and SPBench by large margins and achieving state-of-the-art performance on SQA3D. These results demonstrate that camera-guided modality fusion is an effective and practical inductive bias for equipping VLMs with genuinely spatially grounded intelligence. We will release code and model checkpoints to support future research.
Recent Large Multimodal Models have demonstrated remarkable reasoning capabilities, especially in solving complex mathematical problems and realizing accurate spatial perception. Our key insight is that these emerging abilities can naturally extend to robotic manipulation by enabling LMMs to directly infer the next goal in language via reasoning, rather than relying on a separate action head. However, this paradigm meets two main challenges: i) How to make LMMs understand the spatial action space, and ii) How to fully exploit the reasoning capacity of LMMs in solving these tasks. To tackle the former challenge, we propose a novel task formulation, which inputs the current states of object parts and the gripper, and reformulates rotation by a new axis representation instead of traditional Euler angles. This representation is more compatible with spatial reasoning and easier to interpret within a unified language space. For the latter challenge, we design a pipeline to utilize cutting-edge LMMs to generate a small but high-quality reasoning dataset of multi-round dialogues that successfully solve manipulation tasks for supervised fine-tuning. Then, we perform reinforcement learning by trial-and-error interactions in simulation to further enhance the model's reasoning abilities for robotic manipulation. Our resulting reasoning model built upon a 7B backbone, named ReasonManip, demonstrates three notable advantages driven by its system-2 level reasoning capabilities: i) exceptional generalizability to out-of-distribution environments, objects, and tasks; ii) inherent sim-to-real transfer ability enabled by the unified language representation shared across domains; iii) transparent interpretability connecting high-level reasoning and low-level control. Extensive experiments demonstrate the effectiveness of the proposed paradigm and its potential to advance LMM-driven robotic manipulation.
Effective human-agent collaboration in physical environments requires understanding not only what to act upon, but also where the actionable elements are and how to interact with them. Existing approaches often operate at the object level or disjointedly handle fine-grained affordance reasoning, lacking coherent, instruction-driven grounding and reasoning. In this work, we introduce a new task: Fine-grained 3D Embodied Reasoning, which requires an agent to predict, for each referenced affordance element in a 3D scene, a structured triplet comprising its spatial location, motion type, and motion axis, based on a task instruction. To solve this task, we propose AffordBot, a novel framework that integrates Multimodal Large Language Models (MLLMs) with a tailored chain-of-thought (CoT) reasoning paradigm. To bridge the gap between 3D input and 2D-compatible MLLMs, we render surround-view images of the scene and project 3D element candidates into these views, forming a rich visual representation aligned with the scene geometry. Our CoT pipeline begins with an active perception stage, prompting the MLLM to select the most informative viewpoint based on the instruction, before proceeding with step-by-step reasoning to localize affordance elements and infer plausible interaction motions. Evaluated on the SceneFun3D dataset, AffordBot achieves state-of-the-art performance, demonstrating strong generalization and physically grounded reasoning with only 3D point cloud input and MLLMs.
Reasoning Like Experts: Leveraging Multimodal Large Language Models for Drawing-based Psychoanalysis
Multimodal Large Language Models (MLLMs) have demonstrated exceptional performance across various objective multimodal perception tasks, yet their application to subjective, emotionally nuanced domains, such as psychological analysis, remains largely unexplored. In this paper, we introduce PICK, a multi-step framework designed for Psychoanalytical Image Comprehension through hierarchical analysis and Knowledge injection with MLLMs, specifically focusing on the House-Tree-Person (HTP) Test, a psychological assessment test. First, we decompose drawings containing multiple instances into semantically meaningful sub-drawings, constructing a hierarchical representation that captures spatial structure and content across three levels: single-object level, multi-object level, and whole level. Next, we analyze these sub-drawings at each level with a targeted focus, extracting psychological or emotional insights from their visual cues. We also introduce an HTP knowledge base and design a feature extraction module, trained with reinforcement learning, to generate a psychological profile for single-object level analysis. This profile captures both holistic stylistic features and dynamic object-specific features (such as those of the house, tree, or person), correlating them with psychological states. Finally, we integrate these multi-faceted information to produce a well-informed assessment that aligns with expert-level reasoning. Our approach bridges the gap between MLLMs and specialized expert domains, offering a structured and interpretable framework for understanding human mental states through visual expression. Experimental results demonstrate that the proposed PICK significantly enhances the capability of MLLMs in psychological analysis. It is further validated as a general framework through extensions to emotion understanding tasks. Codes are released at https://github.com/YanbeiJiang/PICK.
Reasoning Segmentation (RS) is a multimodal vision-text task that requires segmenting objects based on implicit text queries, demanding both precise visual perception and vision-text reasoning capabilities. Current RS approaches rely on fine-tuning vision-language models (VLMs) for both perception and reasoning, but their tokenization of images fundamentally disrupts continuous spatial relationships between objects. We introduce DTwinSeger, a novel RS approach that leverages Digital Twin (DT) representation as an intermediate layer to decouple perception from reasoning. Innovatively, DTwinSeger reformulates RS as a two-stage process, where the first transforms the image into a structured DT representation that preserves spatial relationships and semantic properties and then employs a Large Language Model (LLM) to perform explicit reasoning over this representation to identify target objects. We propose a supervised fine-tuning method specifically for LLM with DT representation, together with a corresponding fine-tuning dataset Seg-DT, to enhance the LLM's reasoning capabilities with DT representations. Experiments show that our method can achieve state-of-the-art performance on two image RS benchmarks and three image referring segmentation benchmarks. It yields that DT representation functions as an effective bridge between vision and text, enabling complex multimodal reasoning tasks to be accomplished solely with an LLM.
Foundation models have ushered in a new era for multimodal video understanding by enabling the extraction of rich spatiotemporal and semantic representations. In this work, we introduce a novel graph-based framework that integrates a vision-language foundation, leveraging VideoMAE for dynamic visual encoding and BERT for contextual text representation, to address the challenge of recognizing fine-grained bimanual manipulation actions. Departing from conventional static graph architectures, our approach constructs an adaptive multimodal graph where nodes represent frames, objects, and textual annotations, and edges encode spatial, temporal, and semantic relationships. These graph structures evolve dynamically based on learned interactions, allowing for flexible and context-aware reasoning. A task-specific attention mechanism within a Graph Attention Network further enhances this reasoning by modulating edge importance based on action semantics. Through extensive evaluations on diverse benchmark datasets, we demonstrate that our method consistently outperforms state-of-the-art baselines, underscoring the effectiveness of combining foundation models with dynamic graph-based reasoning for robust and generalizable action recognition.
Recent advances in large vision-language models (VLMs) have shown significant promise for 3D scene understanding. Existing VLM-based approaches typically align 3D scene features with the VLM's embedding space. However, this implicit alignment often yields suboptimal performance due to the scarcity of 3D data and the inherent complexity of spatial relationships in 3D environments. To address these limitations, we propose a novel hierarchical multimodal representation for 3D scene reasoning that explicitly aligns with VLMs at the input space by leveraging both multi-view images and text descriptions. The text descriptions capture spatial relationships by referencing the 3D coordinates of detected objects, while the multi-view images include a top-down perspective and four directional views (forward, left, right, and backward), ensuring comprehensive scene coverage. Additionally, we introduce a hierarchical feature representation that aggregates patch-level image features into view-level and scene-level representations, enabling the model to reason over both local and global scene context. Experimental results on both situated 3D Q&A and general 3D Q&A benchmarks demonstrate the effectiveness of our approach.
Accurate prediction of communication link quality metrics is essential for vehicle-to-infrastructure (V2I) systems, enabling smooth handovers, efficient beam management, and reliable low-latency communication. The increasing availability of sensor data from modern vehicles motivates the use of multimodal large language models (MLLMs) because of their adaptability across tasks and reasoning capabilities. However, MLLMs inherently lack three-dimensional spatial understanding. To overcome this limitation, a lightweight, plug-and-play bird's-eye view (BEV) injection connector is proposed. In this framework, a BEV of the environment is constructed by collecting sensing data from neighboring vehicles. This BEV representation is then fused with the ego vehicle's input to provide spatial context for the large language model. To support realistic multimodal learning, a co-simulation environment combining CARLA simulator and MATLAB-based ray tracing is developed to generate RGB, LiDAR, GPS, and wireless signal data across varied scenarios. Instructions and ground-truth responses are programmatically extracted from the ray-tracing outputs. Extensive experiments are conducted across three V2I link prediction tasks: line-of-sight (LoS) versus non-line-of-sight (NLoS) classification, link availability, and blockage prediction. Simulation results show that the proposed BEV injection framework consistently improved performance across all tasks. The results indicate that, compared to an ego-only baseline, the proposed approach improves the macro-average of the accuracy metrics by up to 13.9%. The results also show that this performance gain increases by up to 32.7% under challenging rainy and nighttime conditions, confirming the robustness of the framework in adverse settings.
Graph-structured combinatorial challenges are inherently difficult due to their nonlinear and intricate nature, often rendering traditional computational methods ineffective or expensive. However, these challenges can be more naturally tackled by humans through visual representations that harness our innate ability for spatial reasoning. In this study, we propose transforming graphs into images to preserve their higher-order structural features accurately, revolutionizing the representation used in solving graph-structured combinatorial tasks. This approach allows machines to emulate human-like processing in addressing complex combinatorial challenges. By combining the innovative paradigm powered by multimodal large language models (MLLMs) with simple search techniques, we aim to develop a novel and effective framework for tackling such problems. Our investigation into MLLMs spanned a variety of graph-based tasks, from combinatorial problems like influence maximization to sequential decision-making in network dismantling, as well as addressing six fundamental graph-related issues. Our findings demonstrate that MLLMs exhibit exceptional spatial intelligence and a distinctive capability for handling these problems, significantly advancing the potential for machines to comprehend and analyze graph-structured data with a depth and intuition akin to human cognition. These results also imply that integrating MLLMs with simple optimization strategies could form a novel and efficient approach for navigating graph-structured combinatorial challenges without complex derivations, computationally demanding training and fine-tuning.
Tactile sensing provides local essential information that is complementary to visual perception, such as texture, compliance, and force. Despite recent advances in visuotactile representation learning, challenges remain in fusing these modalities and generalizing across tasks and environments without heavy reliance on pre-trained vision-language models. Moreover, existing methods do not study positional encodings, thereby overlooking the multi-stage spatial reasoning needed to capture fine-grained visuotactile correlations. We introduce ViTaPEs, a transformer-based architecture for learning task-agnostic visuotactile representations from paired vision and tactile inputs. Our key idea is a two-stage positional injection: local (modality-specific) positional encodings are added within each stream, and a global positional encoding is added on the joint token sequence immediately before attention, providing a shared positional vocabulary at the stage where cross-modal interaction occurs. We make the positional injection points explicit and conduct controlled ablations that isolate their effect before a token-wise nonlinearity versus immediately before self-attention. Experiments on multiple large-scale real-world datasets show that ViTaPEs not only surpasses state-of-the-art baselines across various recognition tasks but also demonstrates zero-shot generalization to unseen, out-of-domain scenarios. We further demonstrate the transfer-learning strength of ViTaPEs in a robotic grasping task, where it outperforms state-of-the-art baselines in predicting grasp success. Project page: https://sites.google.com/view/vitapes
Recent advances in large multimodal models suggest that explicit reasoning mechanisms play a critical role in improving model reliability, interpretability, and cross-modal alignment. While such reasoning-centric approaches have been proven effective in language and vision tasks, their extension to 3D remains underdeveloped. CoRe3D introduces a unified 3D understanding and generation reasoning framework that jointly operates over semantic and spatial abstractions, enabling high-level intent inferred from language to directly guide low-level 3D content formation. Central to this design is a spatially grounded reasoning representation that decomposes 3D latent space into localized regions, allowing the model to reason over geometry in a compositional and procedural manner. By tightly coupling semantic chain-of-thought inference with structured spatial reasoning, CoRe3D produces 3D outputs that exhibit strong local consistency and faithful alignment with linguistic descriptions.
Multimodal large language models (MLLMs) have made significant advancements in vision understanding and reasoning. However, the autoregressive Transformer architecture used by MLLMs requries tokenization on input images, which limits their ability to accurately ground objects within the 2D image space. This raises an important question: how can sequential language tokens be improved to better ground objects in 2D spatial space for MLLMs? To address this, we present a spatial representation method for grounding objects, namely GETok, that integrates a specialized vocabulary of learnable tokens into MLLMs. GETok first uses grid tokens to partition the image plane into structured spatial anchors, and then exploits offset tokens to enable precise and iterative refinement of localization predictions. By embedding spatial relationships directly into tokens, GETok significantly advances MLLMs in native 2D space reasoning without modifying the autoregressive architecture. Extensive experiments demonstrate that GETok achieves superior performance over the state-of-the-art methods across various referring tasks in both supervised fine-tuning and reinforcement learning settings.
Effective machine learning for natural hazard prediction and monitoring depends on timely access to high-quality, event-specific datasets and models capable of adapting to evolving environmental dynamics (e.g., those induced by climate change). Equally important is model explainability, which enhances trust by clarifying decision-making processes and enabling insight into observed hazard patterns. This article introduces a novel approach for the automated construction of multimodal hazard datasets tailored for supervised learning. Central to our method is an ontology-driven self-labeling pipeline that semantically annotates each data element using concepts from a modular, integrated ontology encompassing geographic, hazard, sensor, spatial, and temporal dimensions. This enriched semantic representation facilitates the rapid generation of event-specific datasets and supports reuse across hazard types. Furthermore, embedding ontological descriptors into machine learning outputs enables explainable AI through semantic reasoning, enhancing the interpretability and transparency of predictions. Our pipeline allows for dynamic dataset creation, model adaptation to newly emerging patterns, and live semantic querying over a knowledge graph. Each dataset instance encapsulates a rich semantic narrative including hazard type, evolution stage, and contextual variables such as land cover, socio-environmental indicators, and historical weather data.
—The efficient management of campus infrastructure presents a complex spatiotemporal forecasting challenge characterized by dynamic interdependencies between physical assets. Traditional models fail to capture these intricate relationships as they treat buildings as independent entities or rely on static correlation structures. This paper introduces a novel Spatiotemporal Graph Neural Network (ST-GNN) framework that reframes infrastructure forecasting as a relational reasoning task, enabling dynamic inference of campus wide interdependencies. Our approach integrates Graph Attention Networks (GAT) to learn time-varying spatial dependencies and Gated Temporal Convolutional Networks (TCNs) to capture multi-scale temporal patterns. A key innovation is our context-sensitive graph construction method that incorporates physical proximity, functional similarity, and human mobility data to create a holistic representation of campus dynamics. Evaluated on a real-world multimodal dataset comprising 24 months of energy and occupancy data from 50 campus buildings, the proposed model demonstrates superior performance, achieving a 16.3% reduction in mean absolute error compared to the strongest baseline. Comprehensive ablation studies confirm the critical contribution of each architectural component, while qualitative analysis reveals the model’s capacity to provide interpretable insights into campus operational patterns. This work provides a powerful framework for intelligent campus management, enabling precise resource allocation, energy optimization, and sustainable operational planning through advanced relational reasoning capabilities.
Existing multimodal document question-answering (QA) systems predominantly rely on flat semantic retrieval, representing documents as a set of disconnected text chunks and largely neglecting their intrinsic hierarchical and relational structures. Such flattening disrupts logical and spatial dependencies - such as section organization, figure-text correspondence, and cross-reference relations, that humans naturally exploit for comprehension. To address this limitation, we introduce a document-level structural Document MAP (DMAP), which explicitly encodes both hierarchical organization and inter-element relationships within multimodal documents. Specifically, we design a Structured-Semantic Understanding Agent to construct DMAP by organizing textual content together with figures, tables, charts, etc. into a human-aligned hierarchical schema that captures both semantic and layout dependencies. Building upon this representation, a Reflective Reasoning Agent performs structure-aware and evidence-driven reasoning, dynamically assessing the sufficiency of retrieved context and iteratively refining answers through targeted interactions with DMAP. Extensive experiments on MMDocQA benchmarks demonstrate that DMAP yields document-specific structural representations aligned with human interpretive patterns, substantially enhancing retrieval precision, reasoning consistency, and multimodal comprehension over conventional RAG-based approaches. Code is available at https://github.com/Forlorin/DMAP
In existing UAV swarm task planning methods, task allocation and path planning are usually solved in separate two modules. Although the complexity of problem solving is reduced, the coupling relationship between task allocation and path optimization is not considered in actual tasks, which leads to the problem of reduced collaborative efficiency in UAV swarms. To address the above problems, this paper proposes a novel UAV swarm decision-making system under unified task and spatial view. In the task part, the system uses a large language model (LLM) for high-level reasoning and Chain of Thought (CoT) task decomposition in the high-level task allocation decision module. The LLM planner iteratively decomposes the task into subtasks and dynamically optimizes the reward function of the agent as conditions change. At the same time, the multimodal representation learning module based on Retrieval Augmented Generation (RAG) knowledge integration fuses heterogeneous inputs (such as vision, radar, depth map) into a compact latent state, while the RAG-based module continuously retrieves relevant past experiences from the UAV knowledge base, and the retrieved contextual information is combined with real-time sensor data to provide information for the current UAV decision. These components are tightly coupled with a closed-loop reinforcement learning (RL) agent for the spatial part, where real-time feedback from the LLM and RAG modules adaptively shapes the reward signals. Experiments on challenging cooperative tasks demonstrate that our system significantly outperforms traditional RL baselines in coordination efficiency and adaptability.
Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual layout (spatial property) with an underlying data representation (textual property) -- grasping both is essential for precise, fine-grained chart reasoning. Motivated by this observation, we propose START, the Spatial and Textual learning for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM's understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart's visual structure, addressing challenges that existing methods cannot handle. To evaluate a model's ability to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evaluation. Leveraging spatial and textual learning, START delivers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.
How to integrate and verify spatial intelligence in foundation models remains an open challenge. Current practice often proxies Visual-Spatial Intelligence (VSI) with purely textual prompts and VQA-style scoring, which obscures geometry, invites linguistic shortcuts, and weakens attribution to genuinely spatial skills. We introduce Spatial Intelligence Grid (SIG): a structured, grid-based schema that explicitly encodes object layouts, inter-object relations, and physically grounded priors. As a complementary channel to text, SIG provides a faithful, compositional representation of scene structure for foundation-model reasoning. Building on SIG, we derive SIG-informed evaluation metrics that quantify a model's intrinsic VSI, which separates spatial capability from language priors. In few-shot in-context learning with state-of-the-art multimodal LLMs (e.g. GPT- and Gemini-family models), SIG yields consistently larger, more stable, and more comprehensive gains across all VSI metrics compared to VQA-only representations, indicating its promise as a data-labeling and training schema for learning VSI. We also release SIGBench, a benchmark of 1.4K driving frames annotated with ground-truth SIG labels and human gaze traces, supporting both grid-based machine VSI tasks and attention-driven, human-like VSI tasks in autonomous-driving scenarios.
We present Omni-RGPT, a multimodal large language model designed to facilitate region-level comprehension for both images and videos. To achieve consistent region representation across spatio-temporal dimensions, we introduce Token Mark, a set of tokens highlighting the target regions within the visual feature space. These tokens are directly embedded into spatial regions using region prompts (e.g., boxes or masks) and simultaneously incorporated into the text prompt to specify the target, establishing a direct connection between visual and text tokens. To further support robust video understanding without requiring tracklets, we introduce an auxiliary task that guides Token Mark by leveraging the consistency of the tokens, enabling stable region interpretation across the video. Additionally, we introduce a large-scale region-level video instruction dataset (RegVID300k). Omni-RGPT achieves state-of-the-art results on image and video-based commonsense reasoning benchmarks while showing strong performance in captioning and referring expression comprehension tasks.
Current foundation models (FMs) rely on token representations that directly fragment continuous real-world multimodal data into discrete tokens. They limit FMs to learning real-world knowledge and relationships purely through statistical correlation rather than leveraging explicit domain knowledge. Consequently, current FMs struggle with maintaining semantic coherence across modalities, capturing fine-grained spatial-temporal dynamics, and performing causal reasoning. These limitations cannot be overcome by simply scaling up model size or expanding datasets. This position paper argues that the machine learning community should consider digital twin (DT) representations, which are outcome-driven digital representations that serve as building blocks for creating virtual replicas of physical processes, as an alternative to the token representation for building FMs. Finally, we discuss how DT representations can address these challenges by providing physically grounded representations that explicitly encode domain knowledge and preserve the continuous nature of real-world processes.
The remarkable reasoning and generalization capabilities of Large Language Models (LLMs) have paved the way for their expanding applications in embodied AI, robotics, and other real-world tasks. To effectively support these applications, grounding in spatial and temporal understanding in multimodal environments is essential. To this end, recent works have leveraged scene graphs, a structured representation that encodes entities, attributes, and their relationships in a scene. However, a comprehensive evaluation of LLMs' ability to utilize scene graphs remains limited. In this work, we introduce Text-Scene Graph (TSG) Bench, a benchmark designed to systematically assess LLMs' ability to (1) understand scene graphs and (2) generate them from textual narratives. With TSG Bench we evaluate 11 LLMs and reveal that, while models perform well on scene graph understanding, they struggle with scene graph generation, particularly for complex narratives. Our analysis indicates that these models fail to effectively decompose discrete scenes from a complex narrative, leading to a bottleneck when generating scene graphs. These findings underscore the need for improved methodologies in scene graph generation and provide valuable insights for future research. The demonstration of our benchmark is available at https://tsg-bench.netlify.app. Additionally, our code and evaluation data are publicly available at https://github.com/docworlds/tsg-bench.
In this paper, a novel benchmark for audio-visual question answering continual learning (AVQACL) is introduced, aiming to study fine-grained scene understanding and spatial-temporal reasoning in videos under a continual learning setting. To facilitate this multimodal continual leaning task, we create two audio-visual question answering continual learning datasets, named Split-AVQA and Split-MUSIC-AVQA based on the AVQA and MUSIC-AVQA datasets, respectively. The experimental results suggest that the model exhibits limited cognitive and reasoning abilities and experiences catastrophic forgetting when processing three modalities simultaneously in a continuous data stream. To address above challenges, we propose a novel continual learning method that incorporates question-guided cross-modal information fusion (QCIF) to focus on question-relevant details for improved feature representation and task-specific knowledge distillation with spatial-temporal feature constraints (TKD-STFC) to preserve the spatial-temporal reasoning knowledge acquired from previous dynamic scenarios. Furthermore, a question semantic consistency constraint (QSCC) is employed to ensure that the model maintains a consistent understanding of question semantics across tasks throughout the continual learning process. Extensive experimental results on Split-AVQA and Split-MUSIC-AVQA datasets illustrate that our method achieves state-of-the-art audio-visual question answering continual learning performance. The code is available at https://github.com/kx-wu/AVQACL.
Embedding a language field in a 3D representation enables richer semantic understanding of spatial environments by linking geometry with descriptive meaning. This allows for a more intuitive human-computer interaction, enabling querying or editing scenes using natural language, and could potentially improve tasks like scene retrieval, navigation, and multimodal reasoning. While such capabilities could be transformative, in particular for large-scale scenes, we find that recent feature distillation approaches cannot effectively learn over massive Internet data due to challenges in semantic feature misalignment and inefficiency in memory and runtime. To this end, we propose a novel approach to address these challenges. First, we introduce extremely low-dimensional semantic bottleneck features as part of the underlying 3D Gaussian representation. These are processed by rendering and passing them through a multi-resolution, feature-based, hash encoder. This significantly improves efficiency both in runtime and GPU memory. Second, we introduce an Attenuated Downsampler module and propose several regularizations addressing the semantic misalignment of ground truth 2D features. We evaluate our method on the in-the-wild HolyScenes dataset and demonstrate that it surpasses existing approaches in both performance and efficiency.
Multimodal Large Language Models (MLLMs) have achieved remarkable success in vision-language tasks but their remote sensing (RS) counterpart are relatively under explored. Unlike natural images, RS imagery presents unique challenges that current MLLMs struggle to handle, particularly in visual grounding and spatial reasoning. This study investigates the limitations of CLIP-based MLLMs in RS, highlighting their failure to differentiate visually distinct yet semantically similar RS images. To address this, we introduce a remote sensing multimodal visual patterns (RSMMVP) benchmark. It is designed to evaluate MLLMs in RS tasks by identifying the CLIP-blind pairs, where CLIP-based models incorrectly assign high similarity scores to visually distinct RS images. Through a visual question answering (VQA) evaluation, we analyze the performance of state-of-the-art MLLMs, revealing significant limitations in RS specific representation learning. The results provide valuable insights into the weaknesses of CLIP-based visual encoding and offer a foundation for future research to develop more effective MLLMs tailored for remote sensing applications.
Multimodal 3D grounding has garnered considerable interest in Vision-Language Models (VLMs) \cite{yin2025spatial} for advancing spatial reasoning in complex environments. However, these models suffer from a severe"2D semantic bias"that arises from over-reliance on 2D image features for coarse localization, largely disregarding 3D geometric inputs and resulting in suboptimal fusion performance. In this paper, we propose a novel training framework called What-Where Representation Re-Forming (W2R2) to tackle this issue via disentangled representation learning and targeted shortcut suppression. Our approach fundamentally reshapes the model's internal space by designating 2D features as semantic beacons for"What"identification and 3D features as spatial anchors for"Where"localization, enabling precise 3D grounding without modifying inference architecture. Key components include a dual-objective loss function with an Alignment Loss that supervises fused predictions using adapted cross-entropy for multimodal synergy, and a Pseudo-Label Loss that penalizes overly effective 2D-dominant pseudo-outputs via a margin-based mechanism. Experiments conducted on ScanRefer and ScanQA demonstrate the effectiveness of W2R2, with significant gains in localization accuracy and robustness, particularly in cluttered outdoor scenes.
The rapid development of Vision-Language models (VLMs) and Multimodal Language Models (MLLMs) in autonomous driving research has significantly reshaped the landscape by enabling richer scene understanding, context-aware reasoning, and more interpretable decision-making. However, a lot of existing work often relies on either single-view encoders that fail to exploit the spatial structure of multi-camera systems or operate on aggregated multi-view features, which lack a unified spatial representation, making it more challenging to reason about ego-centric directions, object relations, and the wider context. We thus present BeLLA, an end-to-end architecture that connects unified 360{\deg} BEV representations with a large language model for question answering in autonomous driving. We primarily evaluate our work using two benchmarks - NuScenes-QA and DriveLM, where BeLLA consistently outperforms existing approaches on questions that require greater spatial reasoning, such as those involving relative object positioning and behavioral understanding of nearby objects, achieving up to +9.3% absolute improvement in certain tasks. In other categories, BeLLA performs competitively, demonstrating the capability of handling a diverse range of questions.
To address the issues of robots lacking spatial semantics and having inaccurate instruction parsing during task planning in open environments, this paper proposes a two-stage composite task planner called DQTP (Dual-Qwen Tandem Planner). This planner adopts two Qwen2-VL multimodal large language models working in tandem: the first model extracts the spatial relationships of task-related objects from scene images or videos through a standardized prompt template and outputs them in the form of natural language prompts as spatial semantic priors; the second model takes composite task instructions, real-time scene videos, and spatial prompts as multimodal inputs to complete task decomposition and generate executable action sequences. To enhance the capabilities of spatial relationship representation and task reasoning, this paper designs a lightweight fine-tuning strategy and constructs a standardized prompt template to obtain high-quality training samples. Experimental results show that DQTP improves the completeness of spatial relationship extraction by 7.6% compared with the single model. In typical home task planning, both visual consistency and physical feasibility are increased by about 12%, and the execution success rate reaches 69.1%. As a result of closed-loop feedback, the execution success rate reaches 87.3%.
Video question answering (VQA) is a multimodal task that requires the interpretation of a video to answer a given question. Existing VQA methods primarily utilize question and answer (Q&A) pairs to learn the spatio-temporal characteristics of video content. However, these annotations are typically event-centric, which is not enough to capture the broader context of each video. The absence of essential details such as object types, spatial layouts, and descriptive attributes restricts the model to learning only a fragmented scene representation. This issue limits the model’s capacity for generalization and higher-level reasoning. In this paper, we propose a fundamental question generation with the integration of question embeddings for video question answering (FIQ), a novel approach designed to strengthen the reasoning ability of the model by enhancing the fundamental understanding of videos. FIQ generates Q&A pairs based on descriptions extracted from videos, enriching the training data with fundamental scene information. Generated Q&A pairs enable the model to understand the primary context, leading to enhanced generalizability and reasoning ability. Furthermore, we incorporate a VQ-CAlign module that assists task-specific question embeddings with visual features, ensuring that essential domain-specific details are preserved to increase the adaptability of downstream tasks. Experiments on SUTD-TrafficQA demonstrate that our FIQ achieves state-of-the-art performance compared to existing baseline methods. Code is available at https://github.com/juyoungohjulie/FIQ
The ability to process information from multiple modalities and to reason through it step-by-step remains a critical challenge in advancing artificial intelligence. However, existing reasoning benchmarks focus on text-only reasoning, or employ multimodal questions that can be answered by directly retrieving information from a non-text modality. Thus, complex reasoning remains poorly understood in multimodal domains. Here, we present MARBLE, a challenging multimodal reasoning benchmark that is designed to scrutinize multimodal language models (MLLMs) in their ability to carefully reason step-by-step through complex multimodal problems and environments. MARBLE is composed of two highly challenging tasks, M-Portal and M-Cube, that require the crafting and understanding of multistep plans under spatial, visual, and physical constraints. We find that current MLLMs perform poorly on MARBLE -- all the 12 advanced models obtain near-random performance on M-Portal and 0% accuracy on M-Cube. Only in simplified subtasks some models outperform the random baseline, indicating that complex reasoning is still a challenge for existing MLLMs. Moreover, we show that perception remains a bottleneck, where MLLMs occasionally fail to extract information from the visual inputs. By shedding a light on the limitations of MLLMs, we hope that MARBLE will spur the development of the next generation of models with the ability to reason and plan across many, multimodal reasoning steps.
While chains-of-thought (CoT) have advanced complex reasoning in multimodal large language models (MLLMs), existing methods remain confined to text or static visual domains, often faltering in dynamic spatial reasoning tasks. To bridge this gap, we present GRASSLAND, a novel maze navigation benchmark designed to evaluate dynamic spatial reasoning. Our experiments show that augmenting textual reasoning chains with dynamic visual drafts, overlaid on input images, significantly outperforms conventional approaches, offering new insights into spatial reasoning in evolving environments. To generalize this capability, we propose D2R (Dynamic Draft-Augmented Reasoning), a training-free framework that seamlessly integrates textual CoT with corresponding visual drafts into MLLMs. Extensive evaluations demonstrate that D2R consistently enhances performance across diverse tasks, establishing a robust baseline for dynamic spatial reasoning without requiring model fine-tuning. Project is open at https://github.com/Cratileo/D2R.
Although large multimodal models (LMMs) have demonstrated remarkable capabilities in visual scene interpretation and reasoning, their capacity for complex and precise 3-dimensional spatial reasoning remains uncertain. Existing benchmarks focus predominantly on 2D spatial understanding and lack a framework to comprehensively evaluate 6D spatial reasoning across varying complexities. To address this limitation, we present Spatial457, a scalable and unbiased synthetic dataset designed with 4 key capability for spatial reasoning: multi-object recognition, 2D location, 3D location, and 3D orientation. We develop a cascading evaluation structure, constructing 7 question types across 5 difficulty levels that range from basic single object recognition to our new proposed complex 6D spatial reasoning tasks. We evaluated various large multimodal models (LMMs) on Spatial457, observing a general decline in performance as task complexity increases, particularly in 3D reasoning and 6D spatial tasks. To quantify these challenges, we introduce the Relative Performance Dropping Rate (RPDR), highlighting key weaknesses in 3D reasoning capabilities. Leveraging the unbiased attribute design of our dataset, we also uncover prediction biases across different attributes, with similar patterns observed in real-world image settings.1 The code is released in https://github.com/XingruiWang/Spatial457.
No abstract available
The 180x360 omnidirectional field of view captured by 360-degree cameras enables their use in a wide range of applications such as embodied AI and virtual reality. Although recent advances in multimodal large language models (MLLMs) have shown promise in visual-spatial reasoning, most studies focus on standard pinhole-view images, leaving omnidirectional perception largely unexplored. In this paper, we ask: Are MLLMs ready for omnidirectional spatial reasoning? To investigate this, we introduce OSR-Bench, the first benchmark specifically designed for this setting. OSR-Bench includes over 153,000 diverse question-answer pairs grounded in high-fidelity panoramic indoor scene maps. It covers key reasoning types including object counting, relative distance, and direction. We also propose a negative sampling strategy that inserts non-existent objects into prompts to evaluate hallucination and grounding robustness. For fine-grained analysis, we design a two-stage evaluation framework assessing both cognitive map generation and QA accuracy using rotation-invariant matching and a combination of rule-based and LLM-based metrics. We evaluate eight state-of-the-art MLLMs, including GPT-4o, Gemini 1.5 Pro, and leading open-source models under zero-shot settings. Results show that current models struggle with spatial reasoning in panoramic contexts, highlighting the need for more perceptually grounded MLLMs. OSR-Bench and code will be released at: https://huggingface.co/datasets/UUUserna/OSR-Bench
Geometry is a fundamental branch of mathematics and plays a crucial role in evaluating the reasoning capabilities of multimodal large language models (MLLMs). However, existing multimodal mathematics benchmarks mainly focus on plane geometry and largely ignore solid geometry, which requires spatial reasoning and is more challenging than plane geometry. To address this critical gap, we introduce SolidGeo, the first large-scale benchmark specifically designed to evaluate the performance of MLLMs on mathematical reasoning tasks in solid geometry. SolidGeo consists of 3,113 real-world K-12 and competition-level problems, each paired with visual context and annotated with difficulty levels and fine-grained solid geometry categories. Our benchmark covers a wide range of 3D reasoning subjects such as projection, unfolding, spatial measurement, and spatial vector, offering a rigorous testbed for assessing solid geometry. Through extensive experiments, we observe that MLLMs encounter substantial challenges in solid geometry math tasks, with a considerable performance gap relative to human capabilities on SolidGeo. Moreover, we analyze the performance, inference efficiency and error patterns of various models, offering insights into the solid geometric mathematical reasoning capabilities of MLLMs. We hope SolidGeo serves as a catalyst for advancing MLLMs toward deeper geometric reasoning and spatial intelligence.
No abstract available
Multimodal remote sensing object detection (MM-RSOD) holds great promise for around-the-clock applications. However, it faces challenges in effectively extracting complementary features due to the modality inconsistency and redundancy. Inconsistency can lead to semantic-spatial misalignment, while redundancy introduces uncertainty that is specific to each modality. To overcome these challenges and enhance complementarity exploration and exploitation, this article proposes a dual-dynamic cross-modal interaction network (DDCINet), a novel framework comprising two key modules: a dual-dynamic cross-modal interaction (DDCI) module and a dynamic feature fusion (DFF) module. The DDCI module simultaneously addresses both modality inconsistency and redundancy by employing a collaborative design of channel-gated spatial cross-attention (CSCA) and cross-modal dynamic filters (CMDFs) on evenly segmented multimodal features. The CSCA component enhances the semantic-spatial correlation between modalities by identifying the most relevant channel-spatial features through cross-attention, addressing modality inconsistency. In parallel, the CMDF component achieves cross-modal context interaction through static convolution and further generates dynamic spatial-variant kernels to filter out irrelevant information between modalities, addressing modality redundancy. Following the improved feature extraction, the DFF module dynamically adjusts interchannel dependencies guided by modal-specific global context to fuse features, achieving better complementarity exploitation. Extensive experiments conducted on three MM-RSOD datasets confirm the superiority and generalizability of the DDCINet framework. Notably, our DDCINet, based on the RoI Transformer benchmark and ResNet50 backbone, achieves 78.4% mAP50 on the DroneVehicle test set and outperforms state-of-the-art (SOTA) methods by large margins.
Multi-organ medical segmentation is a crucial component of medical image processing, essential for doctors to make accurate diagnoses and develop effective treatment plans. Despite significant progress in this field, current multi-organ segmentation models often suffer from inaccurate details, dependence on geometric prompts and loss of spatial information. Addressing these challenges, we introduce a novel model named CRISP-SAM2 with CR oss-modal Interaction and Semantic Prompting based on SAM2. This model represents a promising approach to multi-organ medical segmentation guided by textual descriptions of organs. Our method begins by converting visual and textual inputs into cross-modal contextualized semantics using a progressive cross-attention interaction mechanism. These semantics are then injected into the image encoder to enhance the detailed understanding of visual information. To eliminate reliance on geometric prompts, we use a semantic prompting strategy, replacing the original prompt encoder to sharpen the perception of challenging targets. In addition, a similarity-sorting self-updating strategy for memory and a mask-refining process is applied to further adapt to medical imaging and enhance localized details. Comparative experiments conducted on seven public datasets indicate that CRISP-SAM2 outperforms existing models. Extensive analysis also demonstrates the effectiveness of our method, thereby confirming its superior performance, especially in addressing the limitations mentioned earlier. Our code is available at: https://github.com/YU-deep/CRISP_SAM2.git.
Hyperspectral–multispectral image fusion (HMIF) aims to achieve hyperspectral image (HSI) super-resolution by integrating the rich spectral information of HSI with the high spatial resolution of multispectral image (MSI). Despite remarkable progress enabled by deep learning, HMIF remains challenging. Conventional fusion networks that rely solely on feature concatenation often fail in leveraging the abundant prior knowledge inherent in remote sensing data, thereby limiting their ability to simulate the complex nonlinear relationships found in real-world scenes. Moreover, introducing shallow cross-modal feature sharing frequently results in edge artifacts or spectral distortions, while adopting decoupled branches hinders propagating complementary information across different modalities. To address these limitations, we propose spatial–spectral cross-modal alternating direction method of multipliers (ADMM) unfolding network (SCIAU-Net), an explainable deep learning framework that unfolds the optimization process of the ADMM. SCIAU-Net reformulates two degradation models dominated by HSI and MSI, respectively, into a dual-branch neural architecture with dedicated modules designed to solve the corresponding variables. To begin with, dense VRWKV block (DVB) replace handcrafted components, embedding domain knowledge and physical priors of remote sensing images directly into the network. Moreover, we introduce spatial–spectral cross-modal interaction modules. In the HSI-dominated branch, SpeCIM injects MSI-guided spatial cues via adaptive implicit neural representation to extract spatial details, while in the MSI-dominated branch, SpaCIM employs state space duality to model intergroup spectral dependencies and refine spectral reconstruction. Finally, a principled loss function—comprising a mean squared error term and a Karush–Kuhn–Tucker-consistency term—penalizes the ADMM primal and dual residuals, promoting convergence toward physically consistent solutions. Extensive qualitative and quantitative experiments on five datasets demonstrate that SCIAU-Net achieves state-of-the-art performance in all evaluated scenarios, producing high-resolution HSI with superior spatial and spectral fidelity.
Substantial advancements have been made in the field of salient object detection in image processing in recent years. This research introduces the three-decoder cross-modal interaction network (TCINet) for salient object detection in unregistered red-green-blue (RGB)–thermal image pairs, modeling information from different modal perspectives. TCINet employs a three-decoder framework to process RGB, thermal, and fused feature maps concurrently. To ensure robust integration between the modalities, mitigating the impact of unregistered images and addressing modality imbalances, we introduce the fusion complementary registration (FCR) module. This module guides attention to connect the two modalities and uses atrous spatial pyramid pooling (ASPP) to adapt to image scale changes. To fully utilize the differences between modalities, we designed two distinct decoders: fusion feature decoder (FFD) for decoding the fused features and single-modal decoder (SMD) for decoding single-modal features. Additionally, we incorporated feature enhancement (FE) units into the modal decoding to mitigate the blurring effect caused by high-speed autonomous aerial vehicle (AAV) flight. We use a weighted fusion module (WFM) to dynamically integrate the features decoded by the three decoders to increase the network’s generalization ability. Extensive experiments show that TCINet outperforms existing methods, achieving excellent results on a variety of challenging scenarios containing complex details. The code will be published at https://github.com/zqiuqiu235/TCINet.git.
3D object detection is crucial for autonomous driving, enabling accurate object classification and localization in the real world. Existing methods typically rely on basic element-wise operations to fuse multi-modal features from point clouds and images, limiting the effective learning of camera semantics and LiDAR spatial information. Additionally, the inherent sparsity of point clouds leads to distribution imbalances in receptive fields, and the complexity of 3D objects conceals implicit relational contexts. To address these limitations, we propose CIDRA-Net, a cross-modal interaction fusion network with distribution-relation awareness. First, we introduce a region cross-modal interaction fusion (RCIF) module that combines LiDAR features with camera depth information through dual-modal attention. We then separate and enhance two distribution-level features using a dual-branch distribution perception (DBDP) module to learn point distributions. Additionally, a global-local relation mining (GLRM) strategy is employed to capture both local and global contextual information for better object understanding and refined regression tasks. Our approach achieves state-of-the-art performance on the nuScenes and KITTI benchmarks while demonstrating strong generalization across backbones and robustness against sensor errors.
3D object detection plays a critical role in autonomous driving perception systems. While existing multimodal approaches typically employ independent feature processing streams followed by direct Bird’s Eye View projection for modality fusion, they encounter three critical limitations: insufficient cross-modal complementarity, feature misalignment across modalities, and inefficient computational workflows. To address these challenges, this paper proposes DCI-PRNet, a dual cross-modal interaction and reasoning framework that establishes deep synergistic relationships between 3D LiDAR point clouds and 2D multi-view images. The core innovations of DCI-PRNet lie in its dual cross-modal interaction module and multi-level progressive reasoning module. The dual cross-modal interaction module enables iterative feature refinement through alternating attention mechanisms and residual feature updating, effectively aligning spatial-semantic representations between point clouds and images. The multi-level progressive reasoning module implements detection refinement through cascaded decoder layers, where each stage progressively enhances detection confidence and localization precision via cross-modal aggregation. Experiments on the nuScenes dataset demonstrate significant performance improvements over conventional methods, achieving 71.7% mAP and 74.2% NDS.
This study introduces a novel cross-modal spatial-spectral interaction Mamba (CMS2I-Mamba) for remote sensing image fusion classification. Unlike convolution-based models focusing on local details and Transformer-based models with high computational complexity, CMS2I-Mamba efficiently models global long-range dependencies in a linear complexity manner. First, multispectral (MS) and panchromatic (PAN) images each have unique advantages in the spectral and spatial attributes. Given this, this article innovatively designs the multipath selective-scan mechanism (MPS2M), which applies different path scanning strategies to deeply capture the global features from both spectral and spatial dimensions, aiming to enhance the robustness and complementarity of spatial-spectral features. Second, to overcome the characterization differences between images acquired by different sensors, this article further introduces the channel interaction alignment module (CIAM). This module employs efficient former-last and odd-even channel interaction strategies to achieve precise semantic alignment of deep features between modalities. Finally, to leverage the shared fusion features to guide the unique singular features, this article proposes a semantic-aware calibration module (SACM), which accurately constraints and calibrates the same semantic information in deep features. This not only enhances the model’s ability to understand scene semantics, but also promotes the deep fusion and utilization of information between different modalities. Through experimental verification on multiple datasets, the CMS2I-Mamba proposed in this article shows excellent recognition performance and computational efficiency (parameter quantity and running speed) in fusion classification tasks. The code for CMS2I-Mamba is available at: https://github.com/ru-willow/CMSI-Mamba.
Visual grounding relies on reasoning between visual and language modalities. Existing multimodal interaction methods struggle to handle complex cross-modal relationships and perform poorly in dynamic scenes. Most paradigms are constrained by single spatial-domain attention, making it difficult to capture global context, long-range dependencies, and balance local and global features. To address these challenges, we propose the Harmonized Spectrum-Gaussian Adaptive Attention Mechanism (HSGAM), a novel mechanism that combines frequency-domain and Gaussian adaptive modulation. HSGAM transforms visual and language features into the frequency domain, overcoming the limitations of spatial-domain self-attention in capturing long-distance dependencies. It also introduces Gaussian adaptive modulation to dynamically adjust feature interactions based on the characteristics of the visual and language modalities. Additionally, we propose the Refinative Discriminative Frequency Network, a feedforward network incorporating enhancement-mitigation and gating mechanisms. Extensive experiments on five benchmark visual grounding tasks illustrate the superiority of our network.
Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose \textbf{Temperature-Adjusted Cross-modal Attention (TACA)}, a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our codes are publicly available at \href{https://github.com/Vchitect/TACA}
ABSTRACT Heterogeneous change detection is a task of considerable practical importance and significant challenge in remote sensing. Heterogeneous change detection involves identifying change areas using remote sensing images obtained from different sensors or imaging conditions. Recently, research has focused on feature space translation methods based on deep learning technology for heterogeneous images. However, these types of methods often lead to the loss of original image information, and the translated features cannot be efficiently compared, further limiting the accuracy of change detection. For these issues, we propose a cross-modal feature interaction network (CMFINet). Specifically, CMFINet introduces a cross-modal interaction module (CMIM), which facilitates the interaction between heterogeneous features through attention exchange. This approach promotes consistent representation of heterogeneous features while preserving image characteristics. Additionally, we design a differential feature extraction module (DFEM) to enhance the extraction of true change features from spatial and channel dimensions, facilitating efficient comparison after feature interaction. Extensive experiments conducted on the California, Toulouse, and Wuhan datasets demonstrate that CMFINet outperforms eight existing methods in identifying change areas in different scenes from multimodal images. Compared to the existing methods applied to the three datasets, CMFINet achieved the highest F1 scores of 83.93%, 75.65%, and 95.42%, and the highest mIoU values of 85.38%, 78.34%, and 94.87%, respectively. The results demonstrate the effectiveness and applicability of CMFINet in heterogeneous change detection.
Hyperspectral image (HSI) and light detection and ranging (LiDAR) data joint classification is a challenging task. Existing multisource remote sensing data classification methods often rely on human-designed frameworks for feature extraction, which heavily depend on expert knowledge. To address these limitations, we propose a novel dynamic cross-modal feature interaction network (DCMNet), the first framework leveraging a dynamic routing mechanism for HSI and LiDAR classification. Specifically, our approach introduces three feature interaction blocks: bilinear spatial attention block (BSAB), bilinear channel attention block (BCAB), and integration convolutional block (ICB). These blocks are designed to effectively enhance spatial, spectral, and discriminative feature interactions. A multilayer routing space with routing gates is designed to determine optimal computational paths, enabling data-dependent feature fusion. Additionally, bilinear attention mechanisms are employed to enhance feature interactions in spatial and channel representations. Extensive experiments on three public HSI and LiDAR datasets demonstrate the superiority of DCMNet over the state-of-the-art methods. Our codes are available at https://github.com/oucailab/DCMNet.
Video Question Answering (VideoQA) requires models to comprehend video content and generate answers to natural language questions. VideoQA must reason over both spatial and temporal dimensions, presenting unique challenges as questions require varying degrees of spatial and temporal visual information. This paper proposes a Cross-Modal Spatio-Temporal Interaction Network that adaptively performs spatial and temporal interactions between video and text modalities based on question intent, without requiring additional annotations. Our approach integrates feature representation, intra-modal perception, cross-modal spatio-temporal interaction, and answer generation. The model extracts video and question features, introduces learnable tokens for global semantics and spatio-temporal intent, and employs attention mechanisms to adaptively fuse spatial and temporal information. Experiments on MSVD-QA and MSRVTT-QA datasets demonstrate that our method achieves competitive performance, with achieving $48.4 \%$ accuracy on MSVD-QA, outperforming the second-best method by 2.4%. Ablation studies verify the effectiveness of each proposed module, with visualizations confirming the model’s ability to adaptively focus on spatial or temporal information based on question intent.
In the cross-modal medical image segmentation method, it is easy to ignore the dependence between spatial features and frequency features, and fine-grained frequency features are not fused effectively. To solve the above problems, this paper proposes a cross-modal segmentation network DBW-Net. The main innovation work are as follows: Firstly, a cross-modal dual-domain bi-direction feature interaction segmentation network DBW-Net is designed. there are 3 encoders and 1 decoder, the 3 encoders are used to extract the features of PET/CT, PET and CT respectively. Secondly, a Cross-Modal Feature Extractor “from frequency to spatial " (CMFE(F-> S)) is designed in the encoder. The module converts the spatial map into multiple spectral maps by 2D Discrete Cosine Transform (2D DCT). Multi-frequency cross-dimension attention is used to capture the correlation among multiple spectral maps feature in different dimension. so as to generate a refined frequency attention map. This module uses the refined frequency attention map to enhance modal feature and fuse cross-modal interaction, and completes the recalibration about input feature map. Thirdly, a Cross-modal Feature Coupler “from spatial to frequency” (CMFC(S->F)) is designed in the bottleneck layer. The module maps multi-modal information to the spatial and frequency domain through the spatial-frequency feature extractor, Cross-domain coupled attention is used to fuse the semantic gap between multi-modal fine-grained frequency features and spatial features. Finally, in order to verify the effectiveness of the proposed method, experiments are carried out on the clinical multi-modal lung tumor medical image dataset and the Brats2019 brain tumor public dataset. The experimental results show that for lung tumor segmentation, the Miou, Dice, Voe, Rvd and Recall are increased by 3.02%, 2.32%, 4.66%, 2.63% and 4.16%, respectively. For brain tumor segmentation, the Miou, Dice, Voe, Rvd, Recall are increased by 3.06%, 2.31%, 4.68%, 2.64%, 5.76%, respectively. It shows that the model for complex shape lesion segmentation, has high precision and relatively low redundancy. It significantly improves the segmentation accuracy and robustness of the lesion area, and provides technical support for accurate identification and diagnosis of early lesions.
For the problem of ship re-identification, traditional methods struggle to achieve accurate and interpretable recognition. To address this issue, this paper introduces the prototype learning paradigm—characterized by both interpretability and robustness—into the ship re-identification task, and proposes a Prototype-based Cross-Modal Network. The network acquires modality-invariant features through a multimodal feature extraction module, uses a dynamic prototype management mechanism to construct and update class-specific prototypes for capturing discriminative local features, and simultaneously integrates channel and spatial attention mechanisms to enhance the discriminative power of feature representations. We validated this method on the CMShipReID dataset. Experimental results show that Prototype-based Cross-Modal Network achieves high accuracy, and the model’s decision-making process can be intuitively presented through prototype visualization, effectively solving the "black-box" problem of traditional methods. This study verifies the effectiveness of prototype learning in multimodal ship recognition and provides important references for research on explainable artificial intelligence in fields such as multimodal object retrieval.
Constructing comprehensive multimodal feature representations from RGB images (RGB) and point clouds (PT) in 2D–3D multimodal anomaly detection (MAD) methods is very important to reveal various types of industrial anomalies. For multimodal representations, most of the existing MAD methods often consider the explicit spatial correspondence between the modality-specific features extracted from RGB and PT through space-aligned fusion, while overlook the implicit interaction relationships between them. In this study, we propose a uni-modal and cross-modal fusion (UCF) method, which comprehensively incorporates the implicit relationships within and between modalities in multimodal representations. Specifically, UCF first establishes uni-modal and cross-modal embeddings to capture intramodal and intermodal relationships through uni-modal reconstruction and cross-modal mapping. Then, an adaptive nonequal fusion method is proposed to develop fusion embeddings, with the aim of preserving the primary features and reducing interference of the uni-modal and cross-modal embeddings. Finally, uni-modal, cross-modal, and fusion embeddings are all collaborated to reveal anomalies existing in different modalities. Experiments conducted on the MVTec 3D-AD benchmark and the real-world surface mount inspection demonstrate that the proposed UCF outperforms existing approaches, particularly in precise anomaly localization.
This paper presents an innovative approach to anime recommendation systems by integrating multi-modal deep learning with explainable AI techniques. We propose a novel framework that combines visual features, textual content, and user interaction data to create more accurate and interpretable recommendations. Our system addresses key challenges in ex-isting recommendation systems, including the cold-start problem and limited content understanding, through a hybrid architecture that leverages BERT-based natural language processing and convolutional neural networks for visual analysis. Experimental results demonstrate a 27% improvement in recommendation accuracy compared to traditional methods, while providing transparent explanations for recommendations through attention visualization.
Few-shot fine-grained image classification faces significant challenges due to subtle inter-class distinctions and limited annotated samples, where conventional methods often struggle to comprehensively exploit multi-granularity semantic cues under single-scale feature fusion or unimodal representation constraints. To address this, this paper proposes a Multi-Scale Cross-Modal Collaborative Reconstruction Network (MSCMCRN), which synergistically integrates hierarchical feature aggregation, cross-modal interaction, and contrastive-guided optimization. Our framework first introduces a pyramid feature adaptation module that dynamically fuses multi-scale representations through channel-wise and spatial self-attention mechanisms, enabling joint modeling of local discriminative patterns and global structural coherence. A bidirectional crossmodal attention mechanism is then designed to explicitly capture interdependencies between channel-specific attributes and spatialaware contours, effectively enhancing feature discriminability through mutual reinforcement. Furthermore, this paper proposes a collaborative optimization paradigm that unifies bidirectional feature reconstruction consistency with contrastive metric learning, simultaneously ensuring intra-class compactness and inter-class separability in the embedding space. Extensive evaluations on three challenging fine-grained benchmarks (CUB-200-2011, Stanford Cars, NABirds) demonstrate that MSCMCRN consistently surpasses state-of-the-art approaches in classification tasks. The results underscore the effectiveness of hierarchical multi-modal fusion in mitigating information underutilization and the critical role of contrastive constraints in alleviating few-shot overfitting, providing new insights for openenvironment fine-grained recognition scenarios.
Point cloud completion is essential for robotic perception, object reconstruction and supporting downstream tasks like grasp planning, obstacle avoidance, and manipulation. However, incomplete geometry caused by self-occlusion and sensor limitations can significantly degrade downstream reasoning and interaction. To address these challenges, we propose HGACNet, a novel framework that reconstructs complete point clouds of individual objects by hierarchically encoding 3D geometric features and fusing them with image-guided priors from a single-view RGB image. At the core of our approach, the Hierarchical Graph Attention (HGA) encoder adaptively selects critical local points through graph attention-based downsampling and progressively refines hierarchical geometric features to better capture structural continuity and spatial relationships. To strengthen cross-modal interaction, we further design a Multi-Scale Cross-Modal Fusion (MSCF) module that performs attention-based feature alignment between hierarchical geometric features and structured visual representations, enabling fine-grained semantic guidance for completion. In addition, we proposed the contrastive loss (C-Loss) to explicitly align the feature distributions across modalities, improving completion fidelity under modality discrepancy. Finally, extensive experiments conducted on both the ShapeNet-ViPC benchmark and the YCB-Complete dataset confirm the effectiveness of HGACNet, demonstrating state-of-the-art performance as well as strong applicability in real-world robotic manipulation tasks.
Mining area scene classification is crucial for deposit evaluation and environmental monitoring. However, existing methods struggle with homogeneous and heterogeneous spectral spatial and topographic features of mining areas, large intra-class variations, and small target sizes. To overcome these limitations, this study integrates RGB and SAR data to construct a multi-modal dataset and proposes an RGB-SAR mining scene classification model with dual feature enhancement and adaptive cross-modal attention interaction. The model includes: (1) Dual feature enhancement module that suppresses irrelevant features and enhances discriminative multi-scale representations of mining targets; (2) BifocalNet based feature extraction module using a CNN-Transformer hybrid architecture to capture local textures and model global context; (3) Attention based adaptive cross-modal interaction module that achieves deep spectral geometric feature complementarity through the fusion of RGB and SAR modalities. Experiments show the model achieves an OA of 84.58%, outperforming other models and ranking first or second in most evaluation metrics. The proposed dataset and model thus advance mining scene classification.
: To address the issues of severe video quality degradation caused by high-concentration coal dust in confined underground coal mine spaces, which leads to difficulties in behavior detection and discriminative feature learning, this study proposes an improved CRR-YOLO algorithm based on YOLOv11n. To tackle the challenge of learning discriminative features, a cross-modal scene-object matching module, CM-SOM, is designed. By introducing a Vision-Language Model (VLM), it establishes cross-modal interaction between visual and linguistic modalities, enhancing the feature space distinction between targets and backgrounds, thereby improving the semantic discrimination capability of the target detection model in scenarios lacking discriminative features. In the backbone network, a context prior-guided feature extraction network, RepVIT, is embedded. It constructs a dynamic contextual information flow through gated dynamic spatial aggregation to enhance the model, achieving dual guidance of features and weights, and strengthening the model's global semantic understanding and contextual dependency modeling of the scene. Furthermore, a feature fusion network with a recalibration mechanism, Re-FPN, is designed. Through a selective boundary aggregation module and a lightweight feature enhancement module, it enables complementary enhancement of boundary details and high-level semantic information via a bidirectional interaction mechanism, optimizing multi-scale feature fusion. Experiments on the dedicated underground coal mine behavior dataset DsLMF+ demonstrate that CRR-YOLO achieves 84.3% mAP@0.5 and 79.1% F1-score, outperforming several advanced models. With only 2.4M parameters and 6.2 GFLOPs, it achieves an inference speed of 253 FPS, striking a favorable balance among accuracy, speed, and complexity, and exhibits strong potential for practical application
The RGB-D salient object detection technique has garnered significant attention in recent years due to its excellent performance. It outperforms salient object detection methods that rely solely on RGB images by leveraging the geometric morphology and spatial layout information from depth images. However, the existing RGB-D detection model still encounters difficulties in accurately recognising and highlighting salient objects when facing complex scenes containing multiple or small objects. In this study, a Cross-modal Interactive and Global Awareness Fusion Network for RGB-D Salient Object Detection, named CIGNet, is proposed. Specifically, convolutional neural networks (CNNs), which are good at extracting local details, and an attention mechanism, which efficiently integrates global information, are utilized to design two fusion methods for RGB and depth images. One of these methods, the Cross-modal Interaction Fusion Module (CIFM), employs depth separable convolution and common-dimensional dynamic convolution to extract rich edge contours and texture details from low-level features. The Global Awareness Fusion Module (GAFM) is designed to relate high-level features between RGB and depth features so as to improve the model’s understanding of complex scenes. In addition, prediction mapping is generated through a step-by-step decoding process carried out by the Multi-layer Convolutional Fusion Module (MCFM), which gradually yields finer detection results. Finally, comparing 12 mainstream methods on six public benchmark datasets demonstrates superior robustness and accuracy.
DAVIS cameras, which output both event streams and frames simultaneously, are increasingly being used to address the primary object detection challenges posed by complex lighting and motion blur. Nevertheless, fully leveraging the abundant temporal information and effectively fusing data from these two modalities remains a formidable challenge. In this paper, we first design a multi-scale spatio-temporal aggregation (MSTA) module to distill richer semantic information from event frames. Secondly, we assimilate and harness the strengths of YOLOv8 and RT-DETR to develop an innovative encoder with Multi-scale Cross-modal dynamic Interactive fusion and multi-level feature interactive Fusion (MCIF). In MCIF, we propose a dynamic channel switching and spatial attention with learnable fusing factors (DCF-CSSA) to improve the complementary interaction of cross-modal features. Extensive experiments demonstrate that our approach (which we call SCNet) significantly outperforms existing state-of-the-art (SOTA) object detection methods that fuse events and frames, achieving an mAP50 improvement of 6.2% on PKU-DAVIS-SOD and 12% on DESC-MOD, both contain a large number of samples with challenging lighting conditions and motion blur.
Accurate detection of small objects plays an important role in the application of Autonomous aerial vehicles (AAV). However, current works mainly extract comprehensive features from unimodal images, which can obtain very limited distinguishable features for objects, especially those with small sizes. To address this issue, we propose a dynamic cascade cross-modal coassisted network, which integrates multimodal images fusion and fine-grained feature learning to generate powerful object semantic representations. Specifically, we design a multimodal high-order interaction module to achieve collaborative interaction of spatial details and channel dependencies between modalities, thereby enhancing object discrimination. To preserve multimodal fine-grained details, we devise a scale-adaptive dynamic feature prompt module, which dynamically motivates the backbone network to capture feature degradation clues. Meanwhile, to maintain the spatial correlation of multimodal cross-scale features and improve the quality of feature fusion, we derive a global collaborative enhancement module into the feature pyramid network for enhancing the detection accuracy across multiple scales. Extensive experimental results on multimodal datasets have shown that our method achieves favorable performance, surpassing other state-of-the-art methods.
In autonomous driving and robotic navigation, the fusion of multimodal data from LiDAR and cameras relies on accurate extrinsic calibration. However, the calibration accuracy may drop when there is an external disturbance, such as sensor vibrations, temperature fluctuations, and aging. To address this problem, this article presents a novel LiDAR–camera joint calibration network based on cross-modal attention fusion (CMAF) and cross-domain feature extraction (CDFE). The CMAF module is constructed based on region-level matching and pixel-level interaction to improve the cross-modal feature alignment and fusion. To address the semantic inconsistency between encoder and decoder features, the CDFE is designed for a U-shaped architecture with multimodal skip connections to capture large-scale contextual correlations through the transformation from the spatial domain to the frequency domain, and it can maintain semantic consistency through the fusion of global features and original features (residual information) based on the dual-path architecture. Experiments on the KITTI odometry dataset and KITTI-360 dataset show that our network not only significantly outperforms mainstream methods and demonstrates strong generalization capabilities but also achieves high computational efficiency.
Cross-modal localization using text and point clouds enables robots to localize themselves via natural language descriptions, with applications in autonomous navigation and interaction between humans and robots. In this task, objects often recur across text and point clouds, making spatial relationships the most discriminative cues for localization. Given this characteristic, we present SpatiaLoc, a framework utilizing a coarse-to-fine strategy that emphasizes spatial relationships at both the instance and global levels. In the coarse stage, we introduce a Bezier Enhanced Object Spatial Encoder (BEOSE) that models spatial relationships at the instance level using quadratic Bezier curves. Additionally, a Frequency Aware Encoder (FAE) generates spatial representations in the frequency domain at the global level. In the fine stage, an Uncertainty Aware Gaussian Fine Localizer (UGFL) regresses 2D positions by modeling predictions as Gaussian distributions with a loss function aware of uncertainty. Extensive experiments on KITTI360Pose demonstrate that SpatiaLoc significantly outperforms existing state-of-the-art (SOTA) methods.
ABSTRACT In the field of remote sensing, efficient fusion of hyperspectral images (HSI) and light detection and ranging (LiDAR) data can capture comprehensive surface features encompassing spectral and elevation information, thereby enhancing classification performance. However, existing methods face challenges such as significant inter-modal feature discrepancies, insufficient utilization of global contextual information and lack of multi-dimensional interaction in attention mechanisms, etc. To address these issues, this study proposes a hybrid attention based fusion network (HAFNet). Specifically, the self-interactive attention module (SIAM) is designed to model long-range dependencies among spectral, spatial, elevation and cross-modal features, adaptively enhancing global feature representation and overcoming the limitations of traditional methods in global context modelling. The spatial expansion attention module (SEAM) is introduced to optimize the weight distribution of multimodal features and achieve fine-grained control of feature fusion by focusing on key regions. The cross-modal interaction (CMI) block is presented to establish deep feature correlations between HSI and LiDAR modalities such that complementary information across modalities can be efficiently utilized. Experiments on three publicly available HSI-LiDAR benchmark datasets including MUUFL Gulfport, Trento and Houston demonstrate the effectiveness and superiority of the proposed method.
Artificial emotional intelligence is a sub-domain of human–computer interaction research that aims to develop deep learning models capable of detecting and interpreting human emotional states through various modalities. A major challenge in this domain is identifying meaningful correlations between heterogeneous modalities—for example, between audio and visual data—due to their distinct temporal and spatial properties. Traditional fusion techniques used in multimodal learning to combine data from different sources often fail to adequately capture meaningful and less computational cross-modal interactions, and struggle to adapt to varying modality reliability. Following a review of the relevant literature, this study adopts an experimental research method to develop and evaluate a mathematical cross-modal fusion model, thereby addressing a gap in the extant research literature. The framework uses the Tucker tensor decomposition to analyse the multi-dimensional array of data into a set of matrices to support the integration of temporal features from audio and spatiotemporal features from visual modalities. A cross-attention mechanism is incorporated to enhance cross-modal interaction, enabling each modality to attend to the relevant information from the other. The efficacy of the model is rigorously evaluated on three publicly available datasets and the results conclusively demonstrate that the proposed fusion technique outperforms conventional fusion methods and several more recent approaches. The findings break new ground in this field of study and will be of interest to researchers and developers in artificial emotional intelligence.
Text-Based Person Search (TBPS) aims to retrieve target pedestrian images through language descriptions. However, the visual attributes and textual descriptions of different identities (pedestrians) tend to exhibit considerable similarity, leading to Similar Semantic Interference (SSI). To mitigate this issue, we propose the Adapting Cross-Modal Semantic Discrepancy (ACMSD) method, employing a cross-modal constraint approach to alleviate interference in model training. Specifically, we introduce the Consistent Constraint Alignment (CCA) strategy, which establishes both inter-modal alignment and intra-modal alignment, along with an Identity-Balanced Distribution (IBD) loss. This paradigm utilizes Cyclic Image-Text Contrastive to regularize the spatial distribution of the modalities, while the IBD loss implicitly clusters strong positive samples by using identity as a key index. Additionally, we incorporate an Attention-based Implicit Alignment (AIA) module to enforce modality-specific embeddings, thereby strengthening the interaction between cross-modal information. Extensive experiments are conducted on three public benchmark datasets to evaluate the performance of the ACMSD method.
Recent advancements in 3D Large Language Models (LLMs) have revealed significant potential in enhancing the understanding of 3D scenes. However, previous methods have struggled with extracting and utilizing fine-grained information of 3D objects for the coarsness of point clouds, resulting in limitations in understanding object-of-interested (OoI) within the scene. To address this issue, we introduce the object-centric 2D-3D interaction module for enhancing the ability of LLMs for 3D understanding tasks, which consists of the fine-grained 2D representation perception and the object-centric 3D scene representation perception. Specifically, the 2D representation associated with 3D objects is captured based on cross-modal semantic consistency without any spatial projector. Experimental results show that our model significantly outperforms existing methods on benchmarks including ScanRefer and ScanQA.
Abstract Motivation Fine-grained cellular characterization provides critical insights into biological processes, including tissue development, disease progression, and treatment responses. The spatial organization of cells and the interactions among distinct cell types play a pivotal role in shaping the tumor micro-environment, driving heterogeneity, and influencing patient prognosis. While computational pathology can uncover morphological structures from tissue images, conventional methods are often restricted to identifying coarse-grained and limited cell types. In contrast, spatial transcriptomics-based approaches hold promise for pinpointing fine-grained transcriptional cell types using histology data. However, these methods tend to overlook key molecular signatures inherent in gene expression data. Results To this end, we propose a cross-modal unified representation learning framework (CUCA) for identifying fine-grained cell types from histology images. CUCA is trained on paired morphology-molecule spatial transcriptomics data, enabling it to infer fine-grained cell types solely from pathology images. Our model aims to harness the cross-modal embedding alignment paradigm to harmonize the embedding spaces of morphological and molecular modalities, bridging the gap between image patterns and molecular expression signatures. Extensive results across three datasets show that CUCA captures molecule-enhanced cross-modal representations and improves the prediction of fine-grained transcriptional cell abundances. Downstream analyses of cellular spatial architectures and intercellular co-localization reveal that CUCA provides insights into tumor biology, offering potential advancements in cancer research. Availability and implementation The source code of CUCA is available in Zenodo: 10.5281/zenodo.15087256.
Confidence in the results is a key ingredient to improve the adoption of machine learning methods by clinicians. Uncertainties on the results have been considered in the literature, but mostly those originating from the learning and processing methods. Uncertainty on the data is hardly challenged, as a single sample is often considered representative enough of each subject included in the analysis. In this paper, we propose a representation learning strategy to estimate local uncertainties on a physiological descriptor (here, myocardial deformation) previously obtained from medical images by different definitions or computations. We first use manifold alignment to match the latent representations associated to different high-dimensional input descriptors. Then, we formulate plausible distributions of latent uncertainties, and finally exploit them to reconstruct uncertainties on the input high-dimensional descriptors. We demonstrate its relevance for the quantification of myocardial deformation (strain) from 3D echocardiographic image sequences of the right ventricle, for which a lack of consensus exists in its definition and which directional component to use. We used a database of 100 control subjects with right ventricle overload, for which different types of strain are available at each point of the right ventricle endocardial surface mesh. Our approach quantifies local uncertainties on myocardial deformation from different descriptors defining this physiological concept. Such uncertainties cannot be directly estimated by local statistics on such descriptors, potentially of heterogeneous types. Beyond this controlled illustrative application, our methodology has the potential to be generalized to many other population analyses considering heterogeneous high-dimensional descriptors.
PURPOSE Alzheimer's disease (AD) is a neurodegenerative disorder characterized by progressive cognitive decline. We proposed a novel latent multimodal deep learning framework to predict AD cognitive status using clinical, neuroimaging, and genetic data. METHODS Three hundred and twenty-two patients aged between 55 and 92 from the ADNI database were included in the study. Confirmatory Factor Analysis (CFA) was applied to derive the latent scores of AD cognitive impairments as the outcome. A multimodal deep neural network with three modalities, including clinical data, imaging data, and genetic data, was constructed. Attention layers and cross attention layers were added to improve prediction; modality importance scores were calculated for interpretation. Mean Absolute Error (MAE) and Mean Squared Error (MSE) were used to evaluate the model performance. RESULTS The CFA demonstrated good fit to the data. The multimodal neural network of clinical and imaging modalities with attention layers was the best predictive model, with an MAE of 0.330 and an MSE of 0.206. Clinical data contributed the most (35%) to the prediction of AD cognitive status. CONCLUSION Our results demonstrated the attention multimodal model's superior performance in predicting the cognitive impairment of AD, introducing attention layers into the model enhanced the prediction performance.
The bifurcation of generative modeling into autoregressive approaches for discrete data (text) and diffusion approaches for continuous data (images) hinders the development of truly unified multimodal systems. While Masked Language Models (MLMs) offer efficient bidirectional context, they traditionally lack the generative fidelity of autoregressive models and the semantic continuity of diffusion models. Furthermore, extending masked generation to multimodal settings introduces severe alignment challenges and training instability. In this work, we propose \textbf{CoM-DAD} (\textbf{Co}upled \textbf{M}anifold \textbf{D}iscrete \textbf{A}bsorbing \textbf{D}iffusion), a novel probabilistic framework that reformulates multimodal generation as a hierarchical dual-process. CoM-DAD decouples high-level semantic planning from low-level token synthesis. First, we model the semantic manifold via a continuous latent diffusion process; second, we treat token generation as a discrete absorbing diffusion process, regulated by a \textbf{Variable-Rate Noise Schedule}, conditioned on these evolving semantic priors. Crucially, we introduce a \textbf{Stochastic Mixed-Modal Transport} strategy that aligns disparate modalities without requiring heavy contrastive dual-encoders. Our method demonstrates superior stability over standard masked modeling, establishing a new paradigm for scalable, unified text-image generation.
With the integration of image modality, the semantic space of multimodal large language models (MLLMs) is more complex than text-only models, making their interpretability more challenging and their alignment less stable, particularly susceptible to low-quality data, which can lead to inconsistencies between modalities, hallucinations, and biased outputs. As a result, developing interpretability methods for MLLMs is crucial for improving alignment quality and efficiency. In text-only LLMs, Sparse Autoencoders (SAEs) have gained attention for their ability to interpret latent representations. However, extending SAEs to multimodal settings presents new challenges due to modality fusion and the difficulty of isolating cross-modal representations. To address these challenges, we introduce SAE-V, a mechanistic interpretability framework that extends the SAE paradigm to MLLMs. By identifying and analyzing interpretable features along with their corresponding data, SAE-V enables fine-grained interpretation of both model behavior and data quality, facilitating a deeper understanding of cross-modal interactions and alignment dynamics. Moreover, by utilizing cross-modal feature weighting, SAE-V provides an intrinsic data filtering mechanism to enhance model alignment without requiring additional models. Specifically, when applied to the alignment process of MLLMs, SAE-V-based data filtering methods could achieve more than 110% performance with less than 50% data. Our results highlight SAE-V's ability to enhance interpretability and alignment in MLLMs, providing insights into their internal mechanisms.
Although many previous studies have carried out multimodal learning with real-time MRI data that captures the audio-visual kinematics of the vocal tract during speech, these studies have been limited by their reliance on multi-speaker corpora. This prevents such models from learning a detailed relationship between acoustics and articulation due to considerable cross-speaker variability. In this study, we develop unimodal audio and video models as well as multimodal models for phoneme recognition using a long-form single-speaker MRI corpus, with the goal of disentangling and interpreting the contributions of each modality. Audio and multimodal models show similar performance on different phonetic manner classes but diverge on places of articulation. Interpretation of the models' latent space shows similar encoding of the phonetic space across audio and multimodal models, while the models' attention weights highlight differences in acoustic and articulatory timing for certain phonemes.
Multimodal sentiment analysis necessitates the seamless integration of textual and visual signals for the precise interpretation of user-generated material. In this paper, we introduce Dimension-Wise Gated Cross-Attention (DGCA). This new fusion mechanism fine-tunes the interaction between language and images more precisely than prior methods. Our method uses a bidirectional cross-attention module to iteratively enhance text and image features. We use a dimension-wise gating technique in which each latent dimension independently learns to weigh contributions from text or image signals using softmax-normalized modality gates. The approach uses selective per-dimension fusion to highlight important cues from one modality while minimizing less useful characteristics from another. On the SemEval-2020 Memotion dataset, DGCA outperformed the state-of-the-art (SOTA) baselines by 2.27%, highlighting its ability to detect subtle affective cues. In summary, DGCA improves performance and interpretability, enabling fine-grained and context-aware multimodal sentiment analysis.
Cardiac disease diagnosis demands precise interpretation of complex physiological signals, however existing systems often rely on unimodal data and lack adaptive fusion strategies. Most conventional frameworks fall short in capturing intermodal dependencies and adjusting to performance variations across heterogeneous inputs. This study introduces StackTrans, a transformer-based multimodal classification framework designed to improve diagnostic accuracy through ECG and PCG signal integration. The architecture comprises modality-specific transformer encoders, a bidirectional cross-modal fusion transformer that facilitates latent-level attention between modalities, and a stacked ensemble mechanism governed by a meta-learner. Residual learning modules enhance prediction refinement, while entropy-guided adaptive voting improves confidence-weighted decision reliability. The PCG and ECG modules are independently trained using the PhysioNet 2016 and MIT-BIH Arrhythmia datasets, respectively, and integrated through joint inference. Evaluations using TensorFlow on an NVIDIA RTX GPU demonstrate that StackTrans attains a precision of 98.6%, an F1 score of 98.4%, and an AUC of 0.99—outperforming unimodal ECG and PCG models by 2.5% and 7.5%, respectively.
We present a mutually aligned diffusion framework for cross-modal biomechanical motion generation, guided by a dynamical systems perspective. By treating each modality, e.g., observed joint angles ($X$) and ground reaction forces ($Y$), as complementary observations of a shared underlying locomotor dynamical system, our method aligns latent representations at each diffusion step, so that one modality can help denoise and disambiguate the other. Our alignment approach is motivated by the fact that local time windows of $X$ and $Y$ represent the same phase of an underlying dynamical system, thereby benefiting from a shared latent manifold. We introduce a simple local latent manifold alignment (LLMA) strategy that incorporates first-order and second-order alignment within the latent space for robust cross-modal biomechanical generation without bells and whistles. Through experiments on multimodal human biomechanics data, we show that aligning local latent dynamics across modalities improves generation fidelity and yields better representations.
Extracellular electrophysiological recordings present unique computational challenges for neuronal classification due to noise, technical variability, and batch effects across experimental systems. We introduce HIPPIE (High-dimensional Interpretation of Physiological Patterns In Extracellular recordings), a deep learning framework that combines self-supervised pretraining on unlabeled datasets with supervised fine-tuning to classify neurons from extracellular recordings. Using conditional convolutional joint autoencoders, HIPPIE learns robust, technology-adjusted representations of waveforms and spiking dynamics. This model can be applied to electrophysiological classification and clustering across diverse biological cultures and technologies. We validated HIPPIE on both in vivo mouse recordings and in vitro brain slices, where it demonstrated superior performance over other unsupervised methods in cell-type discrimination and aligned closely with anatomically defined classes. Its latent space organizes neurons along electrophysiological gradients, while enabling batch and individual corrected alignment of recordings across experiments. HIPPIE establishes a general framework for systematically decoding neuronal diversity in native and engineered systems.
Many radar applications rely primarily on visual classification for their evaluations. However, new research is integrating textual descriptions alongside visual input and showing that such multimodal fusion improves contextual understanding. A critical issue in this area is the effective alignment of coded text with corresponding images. To this end, our paper presents an adversarial training framework that generates descriptive text from the latent space of a visual radar classifier. Our quantitative evaluations show that this dual-task approach maintains a robust classification accuracy of 98.3% despite the inclusion of Gaussian-distributed latent spaces. Beyond these numerical validations, we conduct a qualitative study of the text output in relation to the classifier’s predictions. This analysis highlights the correlation between the generated descriptions and the assigned categories and provides insight into the classifier’s visual interpretation processes, particularly in the context of normally uninterpretable radar data.
Advanced autoregressive models have reshaped multimodal AI. However, their transformative potential in medical imaging remains largely untapped due to the absence of a unified visual tokenizer -- one capable of capturing fine-grained visual structures for faithful image reconstruction and realistic image synthesis, as well as rich semantics for accurate diagnosis and image interpretation. To this end, we present MedITok, the first unified tokenizer tailored for medical images, encoding both low-level structural details and high-level clinical semantics within a unified latent space. To balance these competing objectives, we introduce a novel two-stage training framework: a visual representation alignment stage that cold-starts the tokenizer reconstruction learning with a visual semantic constraint, followed by a textual semantic representation alignment stage that infuses detailed clinical semantics into the latent space. Trained on the meticulously collected large-scale dataset with over 30 million medical images and 2 million image-caption pairs, MedITok achieves state-of-the-art performance on more than 30 datasets across 9 imaging modalities and 4 different tasks. By providing a unified token space for autoregressive modeling, MedITok supports a wide range of tasks in clinical diagnostics and generative healthcare applications. Model and code will be made publicly available at: https://github.com/Masaaki-75/meditok.
Generative AI confronts semiotics with a new kind of sign-producing machine that actively reshapes the production and interpretation of visual content. Addressing the lack of humanities-based transdisciplinary research on this transformation, this study aims to establish a methodological foundation for the semiotic analysis of multimodal AI. By combining visual, social, quantitative, and multimodal semiotics, the paper proposes an integrated micro–meso–macro framework for evaluating AI-generated images. The analysis moves from the micro-level examination of plastic features and text-to-image translation, through the meso-level of enunciation, narrativity, and causality, to the macro-level of social stereotypes, ideology, creativity, rhetoric, truth, and inference. This is supported by a case study on lonely death and a semiotic explanation of latent space.
Classifier-free guidance (CFG) is a widely used technique for controllable generation in diffusion and flow-based models. Despite its empirical success, CFG relies on a heuristic linear extrapolation that is often sensitive to the guidance scale. In this work, we provide a principled interpretation of CFG through the lens of optimization. We demonstrate that the velocity field in flow matching corresponds to the gradient of a sequence of smoothed distance functions, which guides latent variables toward the scaled target image set. This perspective reveals that the standard CFG formulation is an approximation of this gradient, where the prediction gap, the discrepancy between conditional and unconditional outputs, governs guidance sensitivity. Leveraging this insight, we reformulate the CFG sampling as a homotopy optimization with a manifold constraint. This formulation necessitates a manifold projection step, which we implement via an incremental gradient descent scheme during sampling. To improve computational efficiency and stability, we further enhance this iterative process with Anderson Acceleration without requiring additional model evaluations. Our proposed methods are training-free and consistently refine generation fidelity, prompt alignment, and robustness to the guidance scale. We validate their effectiveness across diverse benchmarks, demonstrating significant improvements on large-scale models such as DiT-XL-2-256, Flux, and Stable Diffusion 3.5.
Acknowledging different wavelengths by imaging mechanisms, optical images usually embed higher low-dimensional manifolds into ambient spaces than SAR images do. How to utilize their complementarity remains challenging for multimodal clustering. In this study, we devise a conditional dual diffusion (CDD) model for multimodal clustering of optical and SAR images, and theoretically prove that it is equivalent to a probability flow ordinary differential equation (ODE) having a unique solution. Different from vanilla diffusion models, the CDD model is equipped with a decoupling autoencoder to predict noises and clear images simultaneously, preserving data manifolds embedded in latent space. To the fuse manifolds of optical and SAR images, we train the model to generate optical images conditioned by SAR images, mapping them into a unified latent space. The learned features extracted from the model are fed to K-means algorithm to produce resulting clusters. To the best of our knowledge, this study could be one of the first diffusion models for multimodal clustering. Extensive comparison experiments on three large-scale optical-SAR pair datasets show the superiority of our method over state-of-the-art (SOTA) methods overall in terms of clustering performance and time consumption. The source code is available at https://github.com/suldier/CDD.
Understanding how Large Language Models (LLMs) perform complex reasoning and their failure mechanisms is a challenge in interpretability research. To provide a measurable geometric analysis perspective, we define the concept of the Reasoning Manifold, a latent low-dimensional geometric structure formed by the internal representations corresponding to all correctly reasoned generations. This structure can be conceptualized as the embodiment of the effective thinking paths that the model has learned to successfully solve a given task. Based on this concept, we build REMA, a framework that explains the origins of failures by quantitatively comparing the spatial relationships of internal model representations corresponding to both erroneous and correct reasoning samples. Specifically, REMA first quantifies the geometric deviation of each erroneous representation by calculating its k-nearest neighbors distance to the approximated manifold formed by correct representations, thereby providing a unified failure signal. It then localizes the divergence points where these deviations first become significant by tracking this deviation metric across the model's layers and comparing it against a baseline of internal fluctuations from correct representations, thus identifying where the reasoning chain begins to go off-track. Our extensive experiments on diverse language and multimodal models and tasks demonstrate the low-dimensional nature of the reasoning manifold and the high separability between erroneous and correct reasoning representations. The results also validate the effectiveness of the REMA framework in analyzing the origins of reasoning failures. This research connects abstract reasoning failures to measurable geometric deviations in representations, providing new avenues for in-depth understanding and diagnosis of the internal computational processes of black-box models.
Eye-tracking data reveals valuable insights into users’ cognitive states but is difficult to analyze due to its structured, non-linguistic nature. While large language models (LLMs) excel at reasoning over text, they struggle with temporal and numerical data. This paper presents a multimodal human–AI collaborative framework designed to enhance cognitive pattern extraction from eye-tracking signals. The framework includes: (1) a multi-stage pipeline using horizontal and vertical segmentation alongside LLM reasoning to uncover latent gaze patterns; (2) an Expert–Model Co-Scoring Module that integrates expert judgment with LLM output to generate trust scores for behavioral interpretations; and (3) a hybrid anomaly detection module combining LSTM-based temporal modeling with LLM-driven semantic analysis. Our results across several LLMs and prompt strategies show improvements in consistency, interpretability, and performance, with up to 50% accuracy in difficulty prediction tasks. This approach offers a scalable, interpretable solution for cognitive modeling and has broad potential in adaptive learning, human–computer interaction, and educational analytics.
Most models of generative AI for images assume that images are inherently low-dimensional objects embedded within a high-dimensional space. Additionally, it is often implicitly assumed that thematic image datasets form smooth or piecewise smooth manifolds. Common approaches overlook the geometric structure and focus solely on probabilistic methods, approximating the probability distribution through universal approximation techniques such as the kernel method. In some generative models the low dimensional nature of the data manifest itself by the introduction of a lower dimensional latent space. Yet, the probability distribution in the latent or the manifold's coordinate space is considered uninteresting and is predefined or considered uniform. In this study, we address the problem of Blind Image Denoising (BID), and to some extent, the problem of generating images from noise by unifying geometric and probabilistic perspectives. We introduce a novel framework that improves upon existing probabilistic approaches by incorporating geometric assumptions that enable the effective use of kernel-based probabilistic methods. Furthermore, the proposed framework extends prior geometric approaches by combining explicit and implicit manifold descriptions through the introduction of a distance function. The resulting framework demystifies diffusion models by interpreting them as a projection mechanism onto the manifold of ``good images''. This interpretation leads to the construction of a new deterministic model, the Manifold-Probabilistic Projection Model (MPPM), which operates in both the representation (pixel) space and the latent space. We demonstrate that the Latent MPPM (LMPPM) outperforms the Latent Diffusion Model (LDM) across various datasets, achieving superior results in terms of image restoration and generation.
Understanding low-dimensional structures within high-dimensional data is crucial for visualization, interpretation, and denoising in complex datasets. Despite the advancements in manifold learning techniques, key challenges-such as limited global insight and the lack of interpretable analytical descriptions-remain unresolved. In this work, we introduce a novel framework, GAMLA (Global Analytical Manifold Learning using Auto-encoding). GAMLA employs a two-round training process within an auto-encoding framework to derive both character and complementary representations for the underlying manifold. With the character representation, the manifold is represented by a parametric function which unfold the manifold to provide a global coordinate. While with the complementary representation, an approximate explicit manifold description is developed, offering a global and analytical representation of smooth manifolds underlying high-dimensional datasets. This enables the analytical derivation of geometric properties such as curvature and normal vectors. Moreover, we find the two representations together decompose the whole latent space and can thus characterize the local spatial structure surrounding the manifold, proving particularly effective in anomaly detection and categorization. Through extensive experiments on benchmark datasets and real-world applications, GAMLA demonstrates its ability to achieve computational efficiency and interpretability while providing precise geometric and structural insights. This framework bridges the gap between data-driven manifold learning and analytical geometry, presenting a versatile tool for exploring the intrinsic properties of complex data sets.
Latent space representations learned through variational autoencoders (VAEs) offer a powerful, unsupervised means of capturing nonlinear structure in high-dimensional oncology data. The latent embedding spaces often encode information that differs from traditional bioinformatics methods such as t-SNE or UMAP. However, a persistent challenge remains: how to meaningfully visualize and interpret these latent variables. Common dimensionality reduction techniques like UMAP and t-SNE, while effective, can obscure graph-theoretic relationships that may underlie important biological patterns. We present a novel approach for intuitive latent space interpretation using NetFlow, a method that visualizes the organizational structure of samples as a graph derived from their latent embeddings. NetFlow constructs a topological representation based on the metric structure of the latent space, drawing on concepts from network analysis, optimal mass transport, topological data analysis, and lineage tracing. The result is an interpretable graph in which nodes represent individual subjects and edges reflect local and global similarity among the samples. We applied this method to multiple myeloma (MM), a hematologic malignancy marked by malignant plasma cell proliferation and inevitable relapse. To uncover hidden disease subtypes, we trained a VAE on multimodal data from 659 patients in the MMRF CoMMpass dataset (IA19), integrating transcriptomic, genomic, and clinical features. Direct clustering of latent space vectors failed to yield subgroups with significant differences in progression-free survival (PFS). In contrast, NetFlow generated a latent space graph that, when clustered using Louvain community detection, identified three distinct subtypes: one high-risk and two low-risk groups. The high-risk group exhibited a median PFS of 1.5 years shorter than the low-risk groups (p<0.001) and was enriched for known poor prognostic markers including gain 1q21 (59%), MAF translocations (17%), and t(4;14) (66%). Although the two low-risk groups had similar PFS outcomes, they differed in their molecular profiles, suggesting they may benefit from different therapeutic strategies. These preliminary results demonstrate that variational autoencoders and NetFlow graph analysis can reveal latent substructures missed by traditional clustering, thereby advancing latent space explainability and enabling improved subtype discovery in MM. Our framework offers a generalizable pipeline for interpreting deep generative models in cancer genomics. Anish K. Simhal, Rena Elkin, Ross S. Firestone, Jung Hun Oh, Joseph O. Deasy. Unsupervised graph-based visualization of variational autoencoder latent spaces reveals hidden multiple myeloma subtypes [abstract]. In: Proceedings of the AACR Special Conference in Cancer Research: Artificial Intelligence and Machine Learning; 2025 Jul 10-12; Montreal, QC, Canada. Philadelphia (PA): AACR; Clin Cancer Res 2025;31(13_Suppl):Abstract nr A031.
The advancement of remote sensing technology has led to a progressive enhancement in the resolution of remote sensing data, offering a multiperspective approach to Earth observation and facilitating a more comprehensive scene interpretation. As two most commonly utilized data sources in remote sensing, optical images, and synthetic aperture radar (SAR) data can provide complementary information, effectively compensating for the limitations inherent to a single modality. However, existing methods for using these two data sources face the following issues. First, insufficient utilization of the complete information provided by the source data. Second, inadequate consideration of the distinct characteristics of different modalities during feature extraction. Third, ignoring the misalignment between heterogeneous data, leading to large information loss. To tackle these challenges, we initially construct a benchmark dataset comprising complex-valued SAR data and optical images, named Multi-Complex-Seg. In order to fully mine the complete and valid information provided by both data sources, we construct a multimodal segmentation framework built on the theory of “subdomain extraction and cross-domain fusion,” in which we design a more suitable feature extractor for complex-valued SAR data, fully considering the unique geometric properties. In addition, a dynamic feature alignment module (DFAM) is proposed to further adjust the cross-modal features, and Cross-modal heterogeneous feature fusion module (CHFFM) first maps features into the same latent space to obtain better fused features. Both DFAM and CHFFM together reduce the huge semantic gap between modalities, thus facilitating the extraction of intramodal specificity and cross-modal complementarity. Extensive experiments on the proposed Multi-Complex-Seg confirm the effectiveness of our framework in comparison to other state-of-the-art multimodal segmentation approaches.
Multimodal human–AI systems generally consider facial expressions and body motions as separate input streams, leading to disjointed interpretations and diminished emotional coherence. To overcome this issue, we offer the Engagement-Safe Expressive Alignment (ESEA) paradigm and the Unified Visual Synchrony (UVS) framework as its computational implementation. UVS models the coherence between facial expressions and gestures, offering an interpretable visual synchrony signal that can function as adaptive feedback in human–AI interactions. The framework’s key component is the Consistency Index for Affective Synchrony (CIAS), which correlates brief visual segments with scalar synchrony scores through a common latent representation. Facial and gestural signals are processed by modality-specific projection networks into a unified latent space, and CIAS is derived from the similarity and short-term temporal consistency of these latent trajectories. The synchrony index is regarded as an estimation of affective visual coherence within the ESEA paradigm. We formalize the UVS/CIAS framework and conduct a comparative experimental evaluation utilizing matched and mismatched face–gesture segments derived from rendered dialog footage. Utilizing ROC analysis, score distribution comparisons, temporal visualizations, and negative control tests, we illustrate that CIAS effectively captures structured face–gesture alignment that surpasses similarity-based baselines, while also delivering a persistent, time-resolved synchronization signal. These findings establish CIAS as a principled and interpretable feedback signal for future affect-aware, engagement-focused multimodal agents.
Predicting spatial gene expression from Histological images is a fundamental task in understanding tissue organization and molecular phenotypes. However, existing methods often rely on single-model representations or lack effective alignment between image and transcriptomic features. To address these limitations, we propose a unified multimodal learning framework that integrates histological imaging and spatial transcriptomics through a shared latent representation space. Specifically, histological H&E images are encoded by a ResNet50-based convolutional stem and a MobileViT Transformer backbone to extract hierarchical visual representations. Both modalities are projected into a shared latent space via linear-GELU-dropout transformation blocks, enabling cross-modal alignment through a contrastive learning objective that maximizes agreement between the corresponding image and the spot embeddings. Experimental results on the 10x Genomics Visium dataset of human liver tissue demonstrate that MViTGene achieves significantly higher prediction accuracy than existing methods across multiple gene subsets, with improvements of 20%, 33%, and 12% in predicting marker genes, highly expressed genes, and highly variable genes, respectively. The significant improvement in relevance indicates that the model can more accurately capture the true correspondence between tissue morphology and gene expression, therefore enabling more reliable biological interpretation. It provides a computational tool for high-throughput spatial gene expression prediction that balances performance and interpretability.
In order to solve the problems of insufficient medical image feature extraction, high classification accuracy, and computational complexity in automatic diagnosis of skin lesions in the edge computing environment, this paper proposes a real-time pseudo-multimodal low-delay diagnosis framework, SCGViT, based on a vision transformer. The framework is constructed around three functional objectives: mitigating data imbalance through generative modeling, capturing diverse representations via multi-dimensional perception, and optimizing feature fusion through adaptive refinement. Firstly, using Class-Conditioned Generative Adversarial Networks (CGANs) simulates manifolds of minority class samples in latent space, achieving preliminary balance of data distribution. Secondly, a branch feature-extraction path is constructed to simulate inversion (INV) and infrared (IR) modes in the original visual primary color mode (RGB), in order to achieve multi-dimensional perception. Finally, a cross-attention mechanism is combined for cross-branch feature aggregation, and a channel-attention mechanism (squeeze and excitation) is embedded for secondary refinement of the mixed global local features to enhance the representation ability of key pathological regions by integrating complementary structural and contrast information. The experimental results on the HAM10000 dataset showed that the F1 score reached 0.973, the inference speed reached 304.439 FPS, the parameter count was only 0.524 M, and the computational complexity was only 0.866 G FLOPs, achieving a balance between high accuracy and light weight.
Pattern electroretinogram (PERG) is the standard for assessing retinal ganglion cell function. However, the low amplitude and complex waveform of PERG signals complicate clinical interpretation. This study proposes a robust, multimodal hybrid machine learning framework that detects retinal dysfunction under a rigorous patient level validation strategy by integrating PERG waveform features with clinical demographic data. The PERG-IOBA dataset, consisting of 1354 signals from 304 participants was used. Training and test sets were separated at the patient level using 5-fold cross validation to approximate real clinical deployment and to avoid information leakage. A dual-stream model was developed. One stream processed functional PERG features, latency, amplitude and RMS via a multilayer perceptron, while the second stream processed clinical data. The two representations were then fused at the feature concatenation level. This model (Model 1) was compared with a stacking ensemble of conventional classifiers (Model 2) and a two-stage cascade classifier tailored for screening (Model 3). Model 2 achieved the most balanced and robust performance with 71.4% accuracy and an Area Under the Curve of 0.76 in 5-fold patient level cross validation. Although more modest than many previously reported values, these metrics are consistent with realistic clinical generalizability. The Model 3 provided the highest sensitivity with 79.7% for screening purposes. SHAP analysis confirmed P50-N95 amplitude as the primary biomarker but identified age as a significant confounding factor, mimicking expert clinical judgment. This study demonstrates that retinal dysfunction detection requires a whole approach that integrates signal morphology and patient demographics.
This paper presents a retrieval-augmented, CLIP-based pipeline to support educators and psychologists in interpreting children’s drawings. The proposed system integrates vision-language captioning, case-based retrieval, and large language model generation, where CLIP produces descriptive captions, FAISS retrieves the top- $k$ =3 semantically similar expert-annotated cases, and GPT-3.5 synthesizes a psychologically informed narrative report. Evaluation on 1,222 drawings with expert reference analyses considers six controlled variants (A1–A6) across two axes: (i) semantic similarity to expert interpretations (cosine similarity, BERTScore, ROUGE-L F1) and (ii) coverage and robustness (content-unit precision/recall/F1, ROUGE-L recall, new-information rate, and negation mismatch). In addition to the CLIP/BLIP ablations (A1–A4), two image-only multimodal VLM baselines are benchmarked (GPT-4o and LLaVA; A5–A6), and a small blinded expert rating study is conducted on 20 randomly selected drawings to contextualize trade-offs. Results show that retrieval-grounded configurations improve coverage compared to non-retrieval baselines and, in the CLIP-based comparison (A3 vs. A4), reduce polarity (negation) mismatches while maintaining competitive semantic similarity. Although GPT-4o attains the strongest similarity and content-unit scores overall, it incurs substantially higher latency and does not provide retrieval-based evidence trails; LLaVA shows mixed gains with notably lower content recall. The proposed CLIP-based variant demonstrates a favorable trade-off between faithfulness and efficiency for practical deployment, and is explicitly positioned as decision support rather than diagnosis, surfacing salient cues and transparent evidence trails for expert oversight. Overall, this work demonstrates that retrieval grounding can make AI-assisted interpretation of children’s drawings more consistent, transparent, and scalable in educational and clinical contexts.
The dyslexia represent a type of neuro-cognitive disorder in which problems with reading fluency, spelling, and comprehension are present throughout the lifespan and go undiagnosed until the onset of severe academic problems. Traditional diagnostic systems are more oriented towards behavioural assessment that is more likely to be biased by the examiner, language barrier and cultural diversity which limits the cost-effectiveness and swiftness of diagnosis. To overcome these limitations, this paper develops a cross-modal attention-based fusion system to develop a multimodal screening model that integrates brain-derived neural indices (NIs) and behavioural measures (BMs) within it. Additional clues to cognitive and motor activity are neurophysiological recording (electroencephalographic responses and functional near-infrared spectroscopy responses during reading and phonological processing), cortical activation, Behavioural (e.g., eye movement patterns, handwriting movement patterns, common literacy tests) and other indicators. Every stream of data is subsequently pre-processed and discriminative feature extracted and transformed into latent embeddings. In order to facilitate strong fusion and lessen the burden of incomplete inputs or the quality of heterogeneous signals, an attention selective alignment module targets modality-based information selectively. A predictive classifier then, in turn, utilizes this joint representation to produce probabilistic screening output and risk evaluation that has been calibrated. It adds a layer of interpretability to find influential attributes by emphasizing on transparency to address the clinical critical assessment. A heterogeneous cohort is more accurate and sensitive than behaviour only baselines, and equally between demographic and linguistic subgroups and empirically proves the method. The proposed research design will offer a translationally linking between the neural correlates, upon which dyslexia screening is intended to aid interpretation of functional measures in the brain, and thereby help to achieve a more equitable, earlier and more reliable dyslexia screening to assist contribute to increased evidence-based intervention planning.
Single-cell omics technologies now enable simultaneous profiling of multiple genomic modalities within individual cells. Integration of these multi-omics data necessitates computational frameworks that establish cross-modal associations while preserving biological fidelity. A central challenge lies in balancing two competing objectives: alignment of heterogeneous omics layers and retention of modality-specific distributions. Excessive alignment may lead to semantic loss, while strict distribution preservation may trigger modal separation in potential representations. This fundamental trade-off underscores the need for advanced strategies to harmonize integrative accuracy with biological authenticity. This paper innovatively proposes a multi-stage feature deep fusion formalization method with omics multi-core manifold preservation and guided gating optimization strategy and designs the scMAG algorithm. This is a framework aimed at improving the clustering accuracy and data visualization of single-cell multi-omics, while achieving adaptive alignment of multi-omics latent spaces and optimizing the distribution of omics data in the environmental space, effectively suppressing biological noise and measurement errors. To comprehensively evaluate the scMAG algorithm, we compared the paired datasets scRNA-seq with scATAT-seq and ADT as a benchmark. We consistently observed that scMAG outperformed other algorithms in terms of clustering effect and data visualization clarity. Further multi-task experimental analysis indicates that this enhancement stems from the ability of scMAG to improve the distribution of potential features in the data space and adaptively balance the shared signals and modality-specific signals across multi-omics. scMAG not only significantly improved clustering performance, but also showed excellent performance in feature dimensionality reduction, batch effect removal, multimodal data integration, cell trajectory inference, etc., and also had good biological interpretability. This method provides new theoretical support and practical reference for in-depth analysis of multimodal single-cell data.
最终分组将多模态数据的空间可解释性研究划分为八个维度。核心研究路径呈现出从底层“几何流形与潜在空间对齐”到中层“垂直行业(医疗、遥感、感知)空间模式提取”,再到高层“大模型空间推理评测与生成控制”的演进趋势。研究重点已从简单的多模态特征融合转向对模型内部空间处理机理的机械解释(如 SAE 应用),以及在具身智能、自动驾驶等物理交互场景中保持时空逻辑的一致性与鲁棒性。