U-Net 显著性检测
基于架构改进的U-Net显著性检测方法
聚焦于U-Net骨干网的直接优化,通过集成注意力机制、多尺度模块及创新性架构组件来提升显著性特征提取与语义理解能力。
- Salient Region Detection in Images Based on U-Net and Deep Learning(K. Kumar, M. Marimuthu, Ayan Das Gupta, Bhasker Pant, Surendra Kumar Shukla, Dhiraj Kapila, 2022, 2022 Second International Conference on Advanced Technologies in Intelligent Control, Environment, Computing & Communication Engineering (ICATIECE))
- Att-U2Net: Using Attention to Enhance Semantic Representation for Salient Object Detection(Chenzhe Jiang, Banglian Xu, Qinghe Zheng, Zhengtao Li, Leihong Zhang, Zimin Shen, Quan Sun, Dawei Zhang, 2024, IET Signal Processing)
- Salient object detection via multi-scale attention CNN(Yuzhu Ji, Haijun Zhang, Q. M. J. Wu, 2018, Neurocomputing)
- SA-UNet: A Saliency-Aware Segmentation Network for Waterlogging Disaster Identification in Urban Rail Transit Systems(Jiaying Fan, Xinbo Jiang, Changyuan Chen, Yang Li, Shilun Ma, Jiakai Tian, Rui Guo, 2025, 2025 IEEE 6th International Conference on Computer, Big Data, Artificial Intelligence (ICCBD+AI))
- CMAD-UNet: UNet-Driven RGB-D Salient Object Detection with Cross-Modal Consistency and Aggregative Decoding(Qi Xu, Zhaozhao Su, Zhaoru Guo, Yongming Li, Liejun Wang, Panpan Zheng, 2025, Proceedings of the 2025 International Conference on Multimedia Retrieval)
- Multiscale Cascaded Attention Network for Saliency Detection Based on ResNet(Muwei Jian, Haodong Jin, Xiangyu Liu, Linsong Zhang, 2022, Sensors)
- Hierarchical U-Shape Attention Network for Salient Object Detection(Sanping Zhou, Jinjun Wang, Jimuyang Zhang, Le Wang, Dong Huang, S. Du, N. Zheng, 2020, IEEE Transactions on Image Processing)
- An Improved UNet Algorithm Based on Multiscale Features and Attention Modules for Underwater Salient Multi-Target Detection(Haibin Han, Xinyi Zhou, Haodi Zhu, Yueyi Qiao, Shaojian Yang, Yan Wei, Fengzhong Qu, 2025, OCEANS 2025 Brest)
边缘感知与结构引导的精细化建模
专门解决显著性检测中边界模糊问题,通过引入边缘预测约束、反向注意力机制及结构化特征分解,实现高质量的边缘保持与完整物体分割。
- BASNet: Boundary-Aware Salient Object Detection(Xuebin Qin, Zichen Zhang, Chenyang Huang, Chao Gao, Masood Dehghan, Martin Jägersand, 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Revise-Net: Exploiting Reverse Attention Mechanism for Salient Object Detection(Rukhshanda Hussain, Yash Karbhari, Muhammad Fazal Ijaz, M. Woźniak, P. Singh, R. Sarkar, 2021, Remote Sensing)
- Multi-scale feature aggregation and boundary awareness network for salient object detection(Qin Wu, Jianzhe Wang, Zhilei Chai, Guodong Guo, 2022, Image and Vision Computing)
- Disentangled High Quality Salient Object Detection(Lv Tang, Bo Li, Shouhong Ding, Mofei Song, 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV))
- Decomposition and Completion Network for Salient Object Detection(Zhe Wu, Li Su, Qingming Huang, 2021, IEEE Transactions on Image Processing)
- SODU2-NET: a novel deep learning-based approach for salient object detection utilizing U-NET(Hyder Abbas, Sheng Ren, M. Asim, Syeda Iqra Hassan, A. El-latif, 2025, PeerJ Computer Science)
- EGNet: Edge Guidance Network for Salient Object Detection(Jiaxing Zhao, Jiangjiang Liu, Deng-Ping Fan, Yang Cao, Jufeng Yang, Ming-Ming Cheng, 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV))
- Convolutional Edge Constraint-Based U-Net for Salient Object Detection(L. Han, Xuelong Li, Yongsheng Dong, 2019, IEEE Access)
- Stacked Cross Refinement Network for Edge-Aware Salient Object Detection(Zhe Wu, Li Su, Qingming Huang, 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV))
RGB-D与多模态显著性融合技术
利用深度、热成像或语义信息辅助RGB图像,通过跨模态交互与互补特征增强,提升复杂场景下对目标的识别精度。
- Pushing the Boundaries of Salient Object Detection: A Denoising-Driven Approach(Mengke Song, Luming Li, Xu Yu, Chenglizhao Chen, 2025, IEEE Transactions on Image Processing)
- TriTransNet: RGB-D Salient Object Detection with a Triplet Transformer Embedding Network(Zhengyi Liu, Yuan Wang, Zhengzheng Tu, Yun Xiao, Bin Tang, 2021, Proceedings of the 29th ACM International Conference on Multimedia)
- Hybrid-Attention Network for RGB-D Salient Object Detection(Yuzhen Chen, Wujie Zhou, 2020, Applied Sciences)
- RGB-D Salient Object Detection via 3D Convolutional Neural Networks(Qian Chen, Ze Liu, Y. Zhang, Keren Fu, Qijun Zhao, H. Du, 2021, Proceedings of the AAAI Conference on Artificial Intelligence)
- Real-Time One-Stream Semantic-Guided Refinement Network for RGB-Thermal Salient Object Detection(Fushuo Huo, Xuegui Zhu, Q. Zhang, Ziming Liu, Wenchao Yu, 2022, IEEE Transactions on Instrumentation and Measurement)
- 3-D Convolutional Neural Networks for RGB-D Salient Object Detection and Beyond(Qian Chen, Zhenxi Zhang, Yanye Lu, Keren Fu, Qijun Zhao, 2022, IEEE Transactions on Neural Networks and Learning Systems)
- Deep RGB-D Saliency Detection Without Depth(Yuan-fang Zhang, Jiangbin Zheng, W. Jia, Wenfeng Huang, Long Li, Nian Liu, Fei Li, Xiangjian He, 2021, IEEE Transactions on Multimedia)
- Select, Supplement and Focus for RGB-D Saliency Detection(Miao Zhang, Weisong Ren, Yongri Piao, Zhengkun Rong, Huchuan Lu, 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- UMINet: a unified multi-modality interaction network for RGB-D and RGB-T salient object detection(Lina Gao, P. Fu, Mingzhu Xu, Tiantian Wang, Bing Liu, 2023, The Visual Computer)
- Calibrated RGB-D Salient Object Detection(Wei Ji, Jingjing Li, Shuang Yu, Miao Zhang, Yongri Piao, S. Yao, Qi Bi, Kai Ma, Yefeng Zheng, Huchuan Lu, Li Cheng, 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Specificity-preserving RGB-D saliency detection(Tao Zhou, Huazhu Fu, Geng Chen, Yi Zhou, Deng-Ping Fan, Ling Shao, 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV))
- CATNet: A Cascaded and Aggregated Transformer Network for RGB-D Salient Object Detection(Fuming Sun, Pengfei Ren, Bo Yin, Fasheng Wang, Haojie Li, 2024, IEEE Transactions on Multimedia)
- Advancing in RGB-D Salient Object Detection: A Survey(Ai Chen, Xin Li, Tianxiang He, Junlin Zhou, Du Chen, 2024, Applied Sciences)
- Delving into Calibrated Depth for Accurate RGB-D Salient Object Detection(Jingjing Li, Wei Ji, Miao Zhang, Yongri Piao, Huchuan Lu, Li Cheng, 2022, International Journal of Computer Vision)
轻量化网络设计与实时显著性推断
关注网络推断效率,通过参数精简、高效池化模块与轻量级骨干网设计,实现在边缘设备上的实时显著性预测。
- LARNet: Towards Lightweight, Accurate and Real-Time Salient Object Detection(Zhenyu Wang, Yunzhou Zhang, Yan Liu, Cao Qin, Sonya A. Coleman, D. Kerr, 2024, IEEE Transactions on Multimedia)
- A Simple Pooling-Based Design for Real-Time Salient Object Detection(Jiangjiang Liu, Qibin Hou, Ming-Ming Cheng, Jiashi Feng, Jianmin Jiang, 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- FasterSal: Robust and Real-Time Single-Stream Architecture for RGB-D Salient Object Detection(Jin Zhang, Ruiheng Zhang, Lixin Xu, Xiankai Lu, Yushu Yu, Min Xu, He Zhao, 2025, IEEE Transactions on Multimedia)
- ELWNet: An Extremely Lightweight Approach for Real-Time Salient Object Detection(Zhenyu Wang, Yunzhou Zhang, Yan Liu, Delong Zhu, Sonya A. Coleman, D. Kerr, 2023, IEEE Transactions on Circuits and Systems for Video Technology)
- MEANet: Multi-modal edge-aware network for light field salient object detection(Yao Jiang, Wenbo Zhang, Keren Fu, Qijun Zhao, 2022, Neurocomputing)
- Meanet: An Effective and Lightweight Solution for Salient Object Detection in Optical Remote Sensing Images(Bocheng Liang, Huilan Luo, 2023, Expert Systems with Applications)
基于Transformer与序列建模的全局感知
利用Transformer的长距离依赖建模能力替代或增强传统卷积层,从全局维度提升显著性对象的感知与背景区分度。
- UNETRSal: Saliency Prediction with Hybrid Transformer-Based Architecture(Azamat Kaibaldiyev, Jérémie Pantin, Alexis Lechervy, Fabrice Maurel, Youssef Chahir, Gael Dias, 2025, Lecture Notes in Computer Science)
- TranSalNet: Towards perceptually relevant visual saliency prediction(Jianxun Lou, Hanhe Lin, David Marshall, D. Saupe, Hantao Liu, 2021, Neurocomputing)
- Visual Saliency Transformer(Nian Liu, Ni Zhang, Kaiyuan Wan, Junwei Han, Ling Shao, 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV))
视频显著性与时空特征联合建模
针对视频序列的动态特性,通过时空关联建模捕捉视频中的运动显著性,并优化视频处理的计算效率。
- Salient Object Detection by Spatiotemporal and Semantic Features in Real-Time Video Processing Systems(Yuming Fang, Guanqun Ding, Wenying Wen, Feiniu Yuan, Yong Yang, Zhijun Fang, Weisi Lin, 2020, IEEE Transactions on Industrial Electronics)
- Exploring Rich and Efficient Spatial Temporal Interactions for Real-Time Video Salient Object Detection(Chenglizhao Chen, Guotao Wang, Chong Peng, Yuming Fang, Dingwen Zhang, Hong Qin, 2020, IEEE Transactions on Image Processing)
- Transformer-Based Multi-Scale Feature Integration Network for Video Saliency Prediction(Xiaofei Zhou, Songhe Wu, Ran Shi, Bolun Zheng, Shuai Wang, Haibing Yin, Jiyong Zhang, C. Yan, 2023, IEEE Transactions on Circuits and Systems for Video Technology)
- Real-time Surveillance Video Salient Object Detection Using Collaborative Cloud-Edge Deep Reinforcement Learning(Biao Hou, Junxing Zhang, 2021, 2021 International Joint Conference on Neural Networks (IJCNN))
- Improved salient object detection using hybrid Convolution Recurrent Neural Network(N. V. Kousik, Yuvaraj Natarajan, R. Raja, Suresh Kallam, Rizwan Patan, Amir H. Gandomi, 2021, Expert Systems with Applications)
- Multi-Scale Spatiotemporal Feature Fusion Network for Video Saliency Prediction(Yunzuo Zhang, Tian Zhang, Cunyu Wu, Ran Tao, 2024, IEEE Transactions on Multimedia)
通用显著性建模与多尺度特征聚合
涵盖多尺度上下文信息提取、抗干扰机制设计及领域基础研究,为显著性检测提供广泛的方法论支持。
- Supervision by Fusion: Towards Unsupervised Learning of Deep Salient Object Detector(Dingwen Zhang, Junwei Han, Yu Zhang, 2017, 2017 IEEE International Conference on Computer Vision (ICCV))
- Deep Salient Object Detection With Dense Connections and Distraction Diagnosis(Huaxin Xiao, Jiashi Feng, Yunchao Wei, Maojun Zhang, Shuicheng Yan, 2018, IEEE Transactions on Multimedia)
- Visual saliency based on multiscale deep features(Guanbin Li, Yizhou Yu, 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR))
- Superpixel attention guided network for accurate and real-time salient object detection(Zhiheng Zhou, Yongfan Guo, Junchu Huang, Ming Dai, Ming Deng, Qingjun Yu, 2022, Multimedia Tools and Applications)
- UDNet: Uncertainty-aware deep network for salient object detection(Yuming Fang, Haiyan Zhang, Jiebin Yan, Wenhui Jiang, Yang Liu, 2022, Pattern Recognition)
- The Prediction of Saliency Map for Head and Eye Movements in 360 Degree Images(Yucheng Zhu, Guangtao Zhai, Xiongkuo Min, Jiantao Zhou, 2020, IEEE Transactions on Multimedia)
- Salient Object Detection with Dynamic Convolutions(Rohit Venkata Sai Dulam, Chandra Kambhamettu, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
- DHSNet: Deep Hierarchical Saliency Network for Salient Object Detection(Nian Liu, Junwei Han, 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR))
- Saliency Unified: A Deep Architecture for simultaneous Eye Fixation Prediction and Salient Object Segmentation(S. Kruthiventi, Vennela Gudisa, Jaley H. Dholakiya, R. Venkatesh Babu, 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR))
- Multi-Scale Interactive Network for Salient Object Detection(Youwei Pang, Xiaoqi Zhao, Lihe Zhang, Huchuan Lu, 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Deep Salient Object Detection With Contextual Information Guidance(Yi Liu, Jungong Han, Qiang Zhang, Caifeng Shan, 2020, IEEE Transactions on Image Processing)
- Multi-Scale Cascade Network for Salient Object Detection(X. Li, F. Yang, Hong Cheng, Junyu Chen, Yuxiao Guo, Leiting Chen, 2017, Proceedings of the 25th ACM international conference on Multimedia)
- Salient Object Detection Techniques in Computer Vision—A Survey(A. Gupta, Ayan Seal, M. Prasad, P. Khanna, 2020, Entropy)
- Shallow and Deep Convolutional Networks for Saliency Prediction(Junting Pan, E. Sayrol, Xavier Giro-i-Nieto, Kevin McGuinness, N. O’Connor, 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR))
- Visual saliency detection via combining center prior and U-Net(Xiangwei Lu, Muwei Jian, Xing Wang, Hui Yu, Junyu Dong, K. Lam, 2022, Multimedia Systems)
本次调研梳理了U-Net及其变体在显著性检测领域的广泛应用,研究趋势从基础架构演进扩展至多模态融合、高精度边缘保持、Transformer全局建模、实时轻量化以及视频时空动态分析。这些研究路径相互补充,共同推动显著性检测向鲁棒性更强、细节还原度更高及实时性能更佳的工业级应用方向发展。
总计61篇相关文献
The salient object detection is receiving more and more attention from researchers. An accurate saliency map will be useful for subsequent tasks. However, in most saliency maps predicted by existing models, the objects regions are very blurred and the edges of objects are irregular. The reason is that the hand-crafted features are the main basis for existing traditional methods to predict salient objects, which results in different pixels belonging to the same object often being predicted different saliency scores. Besides, the convolutional neural network (CNN)-based models predict saliency maps at patch scale, which causes the objects edges of the output to be fuzzy. In this paper, we attempt to add an edge convolution constraint to a modified U-Net to predict the saliency map of the image. The network structure we adopt can fuse the features of different layers to reduce the loss of information. Our SalNet predicts the saliency map pixel-by-pixel, rather than at the patch scale as the CNN-based models do. Moreover, in order to better guide the network mining the information of objects edges, we design a new loss function based on image convolution, which adds an L1 constraint to the edge information of saliency map and ground-truth. Finally, experimental results reveal that our SalNet is effective in salient object detection task and is also competitive when compared with 11 state-of-the-art models.
… probabilities of the pixels surrounding the contour of salient objects. To solve this … the salient object but also shift its partial attention to the pixels surrounding the contour of salient objects…
Deep Convolutional Neural Networks have been adopted for salient object detection and achieved the state-of-the-art performance. Most of the previous works however focus on region accuracy but not on the boundary quality. In this paper, we propose a predict-refine architecture, BASNet, and a new hybrid loss for Boundary-Aware Salient object detection. Specifically, the architecture is composed of a densely supervised Encoder-Decoder network and a residual refinement module, which are respectively in charge of saliency prediction and saliency map refinement. The hybrid loss guides the network to learn the transformation between the input image and the ground truth in a three-level hierarchy -- pixel-, patch- and map- level -- by fusing Binary Cross Entropy (BCE), Structural SIMilarity (SSIM) and Intersection-over-Union (IoU) losses. Equipped with the hybrid loss, the proposed predict-refine architecture is able to effectively segment the salient object regions and accurately predict the fine structures with clear boundaries. Experimental results on six public datasets show that our method outperforms the state-of-the-art methods both in terms of regional and boundary evaluation measures. Our method runs at over 25 fps on a single GPU. The code is available at: https://github.com/NathanUA/BASNet.
Detecting and segmenting salient objects from natural scenes, often referred to as salient object detection, has attracted great interest in computer vision. To address this challenge posed by complex backgrounds in salient object detection is crucial for advancing the field. This article proposes a novel deep learning-based architecture called SODU2-NET (Salient object detection U2-Net) for salient object detection that utilizes the U-NET base structure. This model addresses a gap in previous work that focused primarily on complex backgrounds by employing a densely supervised encoder-decoder network. The proposed SODU2-NET employs sophisticated background subtraction techniques and utilizes advanced deep learning architectures that can discern relevant foreground information when dealing with complex backgrounds. Firstly, an enriched encoder block with full feature fusion (FFF) with atrous spatial pyramid pooling (ASPP) varying dilation rates to efficiently capture multi-scale contextual information, improving salient object detection in complex backgrounds and reducing the loss of information during down-sampling. Secondly the block includes an attention module that refines the decoder, is constructed to enhances the detection of salient objects in complex backgrounds by selectively focusing attention on relevant features. This allows the model to reconstruct detailed and contextually relevant information, which is essential to determining salient objects accurately. Finally, the architecture has been improved by adding a residual block at the encoder end, which is responsible for both saliency prediction and map refinement. The proposed network is designed to learn the transformation between input images and ground truth, enabling accurate segmentation of salient object regions with clear borders and accurate prediction of fine structures. SODU2-NET is demonstrated to have superior performance in five public datasets, including DUTS, SOD, DUT OMRON, HKU-IS, PASCAL-S, and a new real world dataset, the Changsha dataset. Based on a comparative assessment of the model FCN, Squeeze-net, Deep Lab, Mask R-CNN the proposed SODU2-NET is found and achieve an improvement of precision (6%), recall (5%) and accuracy (3%). Overall, approach shows promise for improving the accuracy and efficiency of salient object detection in a variety of settings.
Salient object detection aims at locating the most conspicuous objects in natural images, which usually acts as a very important pre-processing procedure in many computer vision tasks. In this paper, we propose a simple yet effective Hierarchical U-shape Attention Network (HUAN) to learn a robust mapping function for salient object detection. Firstly, a novel attention mechanism is formulated to improve the well-known U-shape network, in which the memory consumption can be extensively reduced and the mask quality can be significantly improved by the resulting U-shape Attention Network (UAN). Secondly, a novel hierarchical structure is constructed to well bridge the low-level and high-level feature representations between different UANs, in which both the intra-network and inter-network connections are considered to explore the salient patterns from a local to global view. Thirdly, a novel Mask Fusion Network (MFN) is designed to fuse the intermediate prediction results, so as to generate a salient mask which is in higher-quality than any of those inputs. Our HUAN can be trained together with any backbone network in an end-to-end manner, and high-quality masks can be finally learned to represent the salient objects. Extensive experimental results on several benchmark datasets show that our method significantly outperforms most of the state-of-the-art approaches.
… On the contrary, DHSNet adopts the whole image as the computational unit and propagates the global context information to local contexts hierarchically and progressively, being able …
Recently, fully convolutional networks (FCNs) have made great progress in the task of salient object detection and existing state-of-the-arts methods mainly focus on how to integrate edge information in deep aggregation models. In this paper, we propose a novel Decomposition and Completion Network (DCN), which integrates edge and skeleton as complementary information and models the integrity of salient objects in two stages. In the decomposition network, we propose a cross multi-branch decoder, which iteratively takes advantage of cross-task aggregation and cross-layer aggregation to integrate multi-level multi-task features and predict saliency, edge, and skeleton maps simultaneously. In the completion network, edge and skeleton maps are further utilized to fill flaws and suppress noises in saliency maps via hierarchical structure-aware feature learning and multi-scale feature completion. Through jointly learning with edge and skeleton information for localizing boundaries and interiors of salient objects respectively, the proposed network generates precise saliency maps with uniformly and completely segmented salient objects. Experiments conducted on five benchmark datasets demonstrate that the proposed model outperforms existing networks. Furthermore, we extend the proposed model to the task of RGB-D salient object detection, and it also achieves state-of-the-art performance. The code is available at https://github.com/wuzhe71/DCN.
… To overcome these challenges, we propose CMAD-UNet, an architectural redesign built upon prior … • We propose a novel CMAD-UNet, a UNet-based RGB-D SOD model with feature …
… to boost performance of salient object detection (SOD) in … accuracy of salient objects, which inevitability limits the detection … Therefore existing UNet-like strategies widely adopted for a …
We have witnessed a growing interest in video salient object detection (VSOD) techniques in today’s computer vision applications. In contrast with temporal information (which is still considered a rather unstable source thus far), the spatial information is more stable and ubiquitous, thus it could influence our vision system more. As a result, the current main-stream VSOD approaches have inferred and obtained their saliency primarily from the spatial perspective, still treating temporal information as subordinate. Although the aforementioned methodology of focusing on the spatial aspect is effective in achieving a numeric performance gain, it still has two critical limitations. First, to ensure the dominance by the spatial information, its temporal counterpart remains inadequately used, though in some complex video scenes, the temporal information may represent the only reliable data source, which is critical to derive the correct VSOD. Second, both spatial and temporal saliency cues are often computed independently in advance and then integrated later on, while the interactions between them are omitted completely, resulting in saliency cues with limited quality. To combat these challenges, this paper advocates a novel spatiotemporal network, where the key innovation is the design of its temporal unit. Compared with other existing competitors (e.g., convLSTM), the proposed temporal unit exhibits an extremely lightweight design that does not degrade its strong ability to sense temporal information. Furthermore, it fully enables the computation of temporal saliency cues that interact with their spatial counterparts, ultimately boosting the overall VSOD performance and realizing its full potential towards mutual performance improvement for each. The proposed method is easy to implement yet still effective, achieving high-quality VSOD at 50 FPS in real-time applications.
The human visual system can rapidly focus on prominent objects in complex scenes, significantly enhancing information processing efficiency. Salient object detection (SOD) mimics this biological ability, aiming to identify and segment the most prominent regions or objects in images or videos. This reduces the amount of data needed to process while enhancing the accuracy and efficiency of information extraction. In recent years, SOD has made significant progress in many areas such as deep learning, multi-modal fusion, and attention mechanisms. Additionally, it has expanded in real-time detection, weakly supervised learning, and cross-domain applications. Depth images can provide three-dimensional structural information of a scene, aiding in a more accurate understanding of object shapes and distances. In SOD tasks, depth images enhance detection accuracy and robustness by providing additional geometric information. This additional information is particularly crucial in complex scenes and occlusion situations. This survey reviews the substantial advancements in the field of RGB-Depth SOD, with a focus on the critical roles played by attention mechanisms and cross-modal fusion methods. It summarizes the existing literature, provides a brief overview of mainstream datasets and evaluation metrics, and quantitatively compares the discussed models.
Detection and localization of regions of images that attract immediate human visual attention is currently an intensive area of research in computer vision. The capability of automatic identification and segmentation of such salient image regions has immediate consequences for applications in the field of computer vision, computer graphics, and multimedia. A large number of salient object detection (SOD) methods have been devised to effectively mimic the capability of the human visual system to detect the salient regions in images. These methods can be broadly categorized into two categories based on their feature engineering mechanism: conventional or deep learning-based. In this survey, most of the influential advances in image-based SOD from both conventional as well as deep learning-based categories have been reviewed in detail. Relevant saliency modeling trends with key issues, core techniques, and the scope for future research work have been discussed in the context of difficulties often faced in salient object detection. Results are presented for various challenging cases for some large-scale public datasets. Different metrics considered for assessment of the performance of state-of-the-art salient object detection models are also covered. Some future directions for SOD are presented towards end.
Recently, deep learning-based methods, especially utilizing fully convolutional neural networks, have shown extraordinary performance in salient object detection. Despite its success, the clean boundary detection of the saliency objects is still a challenging task. Most of the contemporary methods focus on exclusive edge detection modules in order to avoid noisy boundaries. In this work, we propose leveraging on the extraction of finer semantic features from multiple encoding layers and attentively re-utilize it in the generation of the final segmentation result. The proposed Revise-Net model is divided into three parts: (a) the prediction module, (b) a residual enhancement module, and (c) reverse attention modules. Firstly, we generate the coarse saliency map through the prediction modules, which are fine-tuned in the enhancement module. Finally, multiple reverse attention modules at varying scales are cascaded between the two networks to guide the prediction module by employing the intermediate segmentation maps generated at each downsampling level of the REM. Our method efficiently classifies the boundary pixels using a combination of binary cross-entropy, similarity index, and intersection over union losses at the pixel, patch, and map levels, thereby effectively segmenting the saliency objects in an image. In comparison with several state-of-the-art frameworks, our proposed Revise-Net model outperforms them with a significant margin on three publicly available datasets, DUTS-TE, ECSSD, and HKU-IS, both on regional and boundary estimation measures.
Fully convolutional neural networks (FCNs) have shown their advantages in the salient object detection task. However, most existing FCNs-based methods still suffer from coarse object boundaries. In this paper, to solve this problem, we focus on the complementarity between salient edge information and salient object information. Accordingly, we present an edge guidance network (EGNet) for salient object detection with three steps to simultaneously model these two kinds of complementary information in a single network. In the first step, we extract the salient object features by a progressive fusion way. In the second step, we integrate the local edge information and global location information to obtain the salient edge features. Finally, to sufficiently leverage these complementary features, we couple the same salient edge features with salient object features at various resolutions. Benefiting from the rich edge information and location information in salient edge features, the fused features can help locate salient objects, especially their boundaries more accurately. Experimental results demonstrate that the proposed method performs favorably against the state-of-the-art methods on six widely used datasets without any pre-processing and post-processing. The source code is available at http: //mmcheng.net/egnet/.
Salient object detection is a fundamental computer vision task. The majority of existing algorithms focus on aggregating multi-level features of pre-trained convolutional neural networks. Moreover, some researchers attempt to utilize edge information for auxiliary training. However, existing edge-aware models design unidirectional frameworks which only use edge features to improve the segmentation features. Motivated by the logical interrelations between binary segmentation and edge maps, we propose a novel Stacked Cross Refinement Network (SCRN) for salient object detection in this paper. Our framework aims to simultaneously refine multi-level features of salient object detection and edge detection by stacking Cross Refinement Unit (CRU). According to the logical interrelations, the CRU designs two direction-specific integration operations, and bidirectionally passes messages between the two tasks. Incorporating the refined edge-preserving features with the typical U-Net, our model detects salient objects accurately. Extensive experiments conducted on six benchmark datasets demonstrate that our method outperforms existing state-of-the-art algorithms in both accuracy and efficiency. Besides, the attribute-based performance on the SOC dataset show that the proposed model ranks first in the majority of challenging scenes. Code can be found at https://github.com/wuzhe71/SCAN.
Human eye fixations often correlate with locations of salient objects in the scene. However, only a handful of approaches have attempted to simultaneously address the related aspects of eye fixations and object saliency. In this work, we propose a deep convolutional neural network (CNN) capable of predicting eye fixations and segmenting salient objects in a unified framework. We design the initial network layers, shared between both the tasks, such that they capture the object level semantics and the global contextual aspects of saliency, while the deeper layers of the network address task specific aspects. In addition, our network captures saliency at multiple scales via inception-style convolution blocks. Our network shows a significant improvement over the current state-of-the-art for both eye fixation prediction and salient object segmentation across a number of challenging datasets.
Visual saliency prediction using transformers - Convolutional neural networks (CNNs) have significantly advanced computational modelling for saliency prediction. However, accurately simulating the mechanisms of visual attention in the human cortex remains an academic challenge. It is critical to integrate properties of human vision into the design of CNN architectures, leading to perceptually more relevant saliency prediction. Due to the inherent inductive biases of CNN architectures, there is a lack of sufficient long-range contextual encoding capacity. This hinders CNN-based saliency models from capturing properties that emulate viewing behaviour of humans. Transformers have shown great potential in encoding long-range information by leveraging the self-attention mechanism. In this paper, we propose a novel saliency model that integrates transformer components to CNNs to capture the long-range contextual visual information. Experimental results show that the transformers provide added value to saliency prediction, enhancing its perceptual relevance in the performance. Our proposed saliency model using transformers has achieved superior results on public benchmarks and competitions for saliency prediction models. The source code of our proposed saliency model TranSalNet is available at: https://github.com/LJOVO/TranSalNet
… saliency prediction. Based on the UNETR [17] architecture, which is an extension of the UNET … difference between the predicted saliency map P and the groundtruth saliency map G. In …
The prediction of salient areas in images has been traditionally addressed with hand-crafted features based on neuroscience principles. This paper, however, addresses the problem with a completely data-driven approach by training a convolutional neural network (convnet). The learning process is formulated as a minimization of a loss function that measures the Euclidean distance of the predicted saliency map with the provided ground truth. The recent publication of large datasets of saliency prediction has provided enough data to train end-to-end architectures that are both fast and accurate. Two designs are proposed: a shallow convnet trained from scratch, and a another deeper solution whose first three layers are adapted from another network trained for classification. To the authors' knowledge, these are the first end-to-end CNNs trained and tested for the purpose of saliency prediction.
Subway systems serve as the lifelines of modern cities, making the safe operation of urban rail transit systems critically important. Among various threats, waterlogging disasters—caused by extreme weather, equipment failures, or external pipeline ruptures — are among the most severe risks to subway systems. Flooding can lead to short circuits and paralysis of high-voltage electrical and signal systems, pose significant challenges to large-scale passenger evacuation, and potentially paralyze the entire urban transportation network, resulting in immeasurable losses. Therefore, accurate identification and monitoring of waterlogging disasters are crucial for early warning and emergency response in subway systems. However, existing deep learning models often struggle when applied to real-world subway environments. These settings are characterized by unique challenges, such as frequent train movements, passenger flows, dynamic advertising displays, and complex metallic reflections. These factors can interfere with the model's perception, making it difficult to accurately focus on actual water bodies and leading to missed or false detections. To address these challenges, this paper proposes a novel saliency-aware segmentation network, SA-UNet. The core idea of SA-UNet is to endow the model with the ability to actively perceive "salient" regions within an image. By introducing a dedicated Dynamic Context Attention (DCA) module, the network generates an implicit saliency map at each stage of feature extraction. This mechanism allows the network to concentrate computational resources on the features most relevant to water bodies while suppressing background interference.
Existing state-of-the-art saliency detection methods heavily rely on CNN-based architectures. Alternatively, we rethink this task from a convolution-free sequence-to-sequence perspective and predict saliency by modeling long-range dependencies, which can not be achieved by convolution. Specifically, we develop a novel unified model based on a pure transformer, namely, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD). It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Unlike conventional architectures used in Vision Transformer (ViT), we leverage multi-level token fusion and propose a new token upsampling method under the transformer framework to get high-resolution detection results. We also develop a token-based multi-task decoder to simultaneously perform saliency and boundary detection by introducing task-related tokens and a novel patch-task-attention mechanism. Experimental results show that our model outperforms existing methods on both RGB and RGB-D SOD benchmark datasets. Most importantly, our whole framework not only provides a new perspective for the SOD field but also shows a new paradigm for transformer-based dense prediction models. Code is available at https://github.com/nnizhang/VST.
By recording the whole scene around the capturer, virtual reality (VR) techniques can provide viewers the sense of presence. To provide a satisfactory quality of experience, there should be at least 60 pixels per degree, so the resolution of panoramas should reach 21600 × 10800. The huge amount of data will put great demands on data processing and transmission. However, when exploring in the virtual environment, viewers only perceive the content in the current field of view (FOV). Therefore if we can predict the head and eye movements which are important behaviors of viewer, more processing resources can be allocated to the active FOV. But conventional saliency prediction methods are not fully adequate for panoramic images. In this paper, a new panorama-oriented model, to predict head and eye movements, is proposed. Due to the superiority of computation in the spherical domain, the spherical harmonics are employed to extract features at different frequency bands and orientations. Related low- and high-level features including the rare components in the frequency domain and color domain, the difference between center vision and peripheral vision, visual equilibrium, person and car detection, and equator bias are extracted to estimate the saliency. To predict head movements, visual mechanisms including visual uncertainty and equilibrium are incorporated, and the graphical model and functional representation for the switch of head orientation are established. Extensive experimental results on the publicly available database demonstrate the effectiveness of our methods.
… In this paper, a center prior-based saliency detection method was proposed, which combines the intrinsic clue of salient objects and the deep learning framework U-Net to improve …
Salient Object Detection (SOD) aims to identify the most attention-grabbing regions in an image and focuses on distinguishing salient objects from their backgrounds. Current SOD methods primarily use a discriminative approach, which works well for clear images but struggles in complex scenes with similar colors and textures between objects and backgrounds. To address these limitations, we introduce the diffusion-based salient object detection model (DiffSOD), which leverages a noise-to-image denoising process within a diffusion framework, enhancing saliency detection in both RGB and RGB-D images. Unlike conventional fusion-based SOD methods that directly merge RGB and depth information, we treat RGB and depth as distinct conditions, i.e., the appearance condition and the structure condition, respectively. These conditions serve as controls within the diffusion UNet architecture, guiding the denoising process. To facilitate this guidance, we employ two specialized control adapters: the appearance control adapter and the structure control adapter. Moreover, conventional denoising UNet models may struggle when handling low-quality depth maps, potentially introducing detrimental cues into the denoising process. To mitigate the impact of low-quality depth maps, we introduce a quality-aware filter. This filter selectively processes only high-quality depth data, ensuring that the denoising process is based on reliable information. Comparative evaluations on benchmark datasets have shown that DiffSOD substantially surpasses existing RGB and RGB-D saliency detection methods, improving average performance by 1.5% and 1.2% respectively, thus setting a new benchmark for diffusion-based dense prediction models in visual saliency detection.
Saliency object detection has been widely used in computer vision tasks such as image understanding, semantic segmentation, and target tracking by mimicking the human visual perceptual system to find the most visually appealing object. The U2Net model has shown good performance in salient object detection (SOD) because of its unique U‐shaped residual structure and the U‐shaped structural backbone incorporating feature information of different scales. However, in the U‐shaped structure, the global semantic information computed from the topmost layer may be gradually interfered by the large amount of local information dilution in the top‐down path, and the U‐shaped residual structure has insufficient attention to the features in the salient target region of the image and will pass redundant features to the next stage. To address these two shortcomings in the U2Net model, this paper proposes improvements in two aspects: to address the situation that the global semantic information is diluted by local semantic information and the residual U‐block (RSU) module pays insufficient attention to the salient regions and redundant features. An attentional gating mechanism is added to filter redundant features in the U‐structure backbone. A channel attention (CA) mechanism is introduced to capture important features in the RSU module. The experimental results prove that the method proposed in this paper has higher accuracy compared to the U2Net model.
Image analysis tasks use salient object detection because it not only identifies important elements of a visual scene but also lessens computational complexity by removing unimportant elements. In this research, we propose a novel salient object recognition method based on a deep learning network that maintains picture information in the mid and low regions. Using a deep learning model, our technique generates a coarse saliency map for the entire target image. The map is then fine-tuned utilising low-to-mid level information particular to the image. For detection of salient objects, we use a U-Net as our architecture. The saliency map can be predicted pixel by pixel, reducing low-level visual information loss. Our results show that our system regularly outperforms other approaches for detecting salient objects, resulting in superior precision and recall rates.
Integration of multi-level contextual information, such as feature maps and side outputs, is crucial for Convolutional Neural Networks (CNNs)-based salient object detection. However, most existing methods either simply concatenate multi-level feature maps or calculate element-wise addition of multi-level side outputs, thus failing to take full advantages of them. In this paper, we propose a new strategy for guiding multi-level contextual information integration, where feature maps and side outputs across layers are fully engaged. Specifically, shallower-level feature maps are guided by the deeper-level side outputs to learn more accurate properties of the salient object. In turn, the deeper-level side outputs can be propagated to high-resolution versions with spatial details complemented by means of shallower-level feature maps. Moreover, a group convolution module is proposed with the aim to achieve high-discriminative feature maps, in which the backbone feature maps are divided into a number of groups and then the convolution is applied to the channels of backbone feature maps within each group. Eventually, the group convolution module is incorporated in the guidance module to further promote the guidance role. Experiments on three public benchmark datasets verify the effectiveness and superiority of the proposed method over the state-of-the-art methods.
… -level fusion maps because the basic computational unit in the intra-image fusion is each … fusion maps because the basic computational unit in the inter-image fusion is each image. …
Aiming at discovering and locating most distinctive objects from visual scenes, salient object detection (SOD) plays an essential role in various computer vision systems. Coming to the era of high resolution, SOD methods are facing new challenges. The major limitation of previous methods is that they try to identify the salient regions and estimate the accurate objects boundaries simultaneously with a single regression task at low-resolution. This practice ignores the inherent difference between the two difficult problems, resulting in poor detection quality. In this paper, we propose a novel deep learning framework for high-resolution SOD task, which disentangles the task into a low-resolution saliency classification network (LRSCN) and a high-resolution refinement network (HRRN). As a pixel-wise classification task, LRSCN is designed to capture sufficient semantics at low-resolution to identify the definite salient, background and uncertain image regions. HRRN is a regression task, which aims at accurately refining the saliency value of pixels in the uncertain region to preserve a clear object boundary at high-resolution with limited GPU memory. It is worth noting that by introducing uncertainty into the training process, our HRRN can well address the high-resolution refinement task without using any high-resolution training data. Extensive experiments on high-resolution saliency datasets as well as some widely used saliency benchmarks show that the proposed method achieves superior performance compared to the state-of-the-art methods.
In this paper, we propose two novel components for improving deep salient object detection models. The first component, called saliency detection network (S-Net), introduces dense short- and long-range connections that effectively integrate multiscale features to better exploit contexts at multiple levels. Benefiting from the direct access to low- and high-level features, the S-Net can not only exploit the object context but also preserve the object boundary sharply, leading to enhanced saliency detection performance. Second, a distraction detection network (D-Net) is developed to learn to diagnose which regions of an input image are distracting and harmful for saliency prediction of the S-Net. With such distraction diagnosis, the regions that are distracting to S-Net are removed in hindsight from the input image and the resulted distraction-free image is fed to S-Net for saliency prediction. To train the D-Net, a distraction mining approach is proposed to localize the model-specific distracting regions through examining the sensitiveness of the S-Net to image regions in a principled manner. Besides, the distraction mining approach also provides a way to interpret decisions made by deep neural network (DNN) saliency detection models, which relieves the black-box issues of DNNs to some extent. Extensive experiments on seven popular benchmark datasets demonstrate the effectiveness of the combined S-Net and D-Net, which provides new state of the arts.
… Abstract—Underwater salient object detection (USOD) has attracted increasing research attention … an improved UNet model incorporating multi-scale features and attention modules to …
Visual saliency is a fundamental problem in both cognitive and computational sciences, including computer vision. In this paper, we discover that a high-quality visual saliency model can be learned from multiscale features extracted using deep convolutional neural networks (CNNs), which have had many successes in visual recognition tasks. For learning such saliency models, we introduce a neural network architecture, which has fully connected layers on top of CNNs responsible for feature extraction at three different scales. We then propose a refinement method to enhance the spatial coherence of our saliency results. Finally, aggregating multiple saliency maps computed for different levels of image segmentation can further boost the performance, yielding saliency maps better than those generated from a single segmentation. To promote further research and evaluation of visual saliency models, we also construct a new large database of 4447 challenging images and their pixelwise saliency annotations. Experimental results demonstrate that our proposed method is capable of achieving state-of-the-art performance on all public benchmarks, improving the F-Measure by 5.0% and 13.2% respectively on the MSRA-B dataset and our new dataset (HKU-IS), and lowering the mean absolute error by 5.7% and 35.1% respectively on these two datasets.
Recently, video saliency prediction has attracted increasing attention, yet the improvement of its accuracy is still subject to the insufficient use of multi-scale spatiotemporal features. To address this issue, we propose a 3D convolutional Multi-scale Spatiotemporal Feature Fusion Network (MSFF-Net) to achieve the full utilization of spatiotemporal features. Specifically, we propose a Bi-directional Temporal-Spatial Feature Pyramid (BiTSFP), the first application of bi-directional fusion architectures in this field, which adds the flow of shallow location information on the basis of the previous flow of deep semantic information. Then, different from simple addition and concatenation, we design an Attention-Guided Fusion (AGF) mechanism that can adaptively learn the fusion weights of adjacent features to integrate them appropriately. Moreover, a Frame-wise Attention (FA) module is introduced to selectively emphasize the useful frames, augmenting the multi-scale temporal features to be fused. Our model is simple but effective, and it can run in real-time. Experimental results on the DHF1K, Hollywood-2, and UCF-sports datasets demonstrate that the proposed MSFF-Net outperforms existing state-of-the-art methods in accuracy.
Abstract Fully convolutional network (FCN) based semantic segmentation models have largely inspired most recent works in the field of salient object detection. However, the lack of context information summarization can degrade the prediction accuracy of the final saliency map. Moreover, the information loss of down-sampling operations of FCN-based models results in the loss of details of the final saliency map, such as edges of the saliency object. In this paper, we proposed a novel deep convolutional neural network (CNN) by introducing a spatial and channel-wise attention layer into a multi-scale encoder-decoder framework. The attention CNN layer can align the context information between the feature maps at different scales and the final prediction of the saliency map. In addition, a structure with multiple scale side-way outputs was designed to produce more accurate edge-preserving saliency maps by integrating saliency maps at different scales. Experimental results demonstrated the effectiveness of the proposed model on several benchmark datasets. Additional experimental results also validated the potential and feasibility of applying our trained saliency model to other object-driven vision tasks as an efficient preprocessing step.
… Hence, multi-scale features and intermediate saliency scores can be jointly … 𝑡𝑡ℎ unit at 𝑖𝑡ℎ stage is a saliency map 𝑃𝑡 𝑠i with scale 𝑠𝑖. The result and learned features of each unit …
Most cutting-edge video saliency prediction models rely on spatiotemporal features extracted by 3D convolutions due to its local contextual cues acquirement ability. However, the shortage of 3D convolutions is that it cannot effectively capture long-term spatiotemporal dependencies in videos. To address this limitation, we propose a novel Transformer-based Multi-scale Feature Integration Network (TMFI-Net) for video saliency prediction, where the proposed TMFI-Net consists of a semantic-guided encoder and a hierarchical decoder. Firstly, embarking on the Transformer-based multi-level spatiotemporal features, the semantic-guided encoder enhances the features by inserting the high-level feature into each level feature via a top-down pathway and a longitudinal connection, which endows the multi-level spatiotemporal features with rich contextual information. In this way, the features are steered to give more concerns to saliency regions. Secondly, the hierarchical decoder employs a multi-dimensional attention (MA) module to elevate features along channel, temporal, and spatial dimensions jointly. Successively, the hierarchical decoder deploys a progressive decoding block to conduct an initial saliency prediction, which provides a coarse localization of saliency regions. Lastly, considering the complementarity of different saliency predictions, we integrate all initial saliency prediction results into the final saliency map. Comprehensive experimental results on four video saliency datasets firmly demonstrate that our model achieves superior performance when compared with the state-of-the-art video saliency models. The code is available at https://github.com/wusonghe/TMFI-Net.
Saliency detection is a key research topic in the field of computer vision. Humans can be accurately and quickly mesmerized by an area of interest in complex and changing scenes through the visual perception area of the brain. Although existing saliency-detection methods can achieve competent performance, they have deficiencies such as unclear margins of salient objects and the interference of background information on the saliency map. In this study, to improve the defects during saliency detection, a multiscale cascaded attention network was designed based on ResNet34. Different from the typical U-shaped encoding–decoding architecture, we devised a contextual feature extraction module to enhance the advanced semantic feature extraction. Specifically, a multiscale cascade block (MCB) and a lightweight channel attention (CA) module were added between the encoding and decoding networks for optimization. To address the blur edge issue, which is neglected by many previous approaches, we adopted the edge thinning module to carry out a deeper edge-thinning process on the output layer image. The experimental results illustrate that this method can achieve competitive saliency-detection performance, and the accuracy and recall rate are improved compared with those of other representative methods.
… To obtain more effective multi-scale features from integrated features, a cross feature … function to enable the network to acquire boundary information and generate high-quality saliency …
Deep-learning based salient object detection methods achieve great progress. However, the variable scale and unknown category of salient objects are great challenges all the time. These are closely related to the utilization of multi-level and multi-scale features. In this paper, we propose the aggregate interaction modules to integrate the features from adjacent levels, in which less noise is introduced because of only using small up-/down-sampling rates. To obtain more efficient multi-scale features from the integrated features, the self-interaction modules are embedded in each decoder unit. Besides, the class imbalance issue caused by the scale variation weakens the effect of the binary cross entropy loss and results in the spatial inconsistency of the predictions. Therefore, we exploit the consistency-enhanced loss to highlight the fore-/back-ground difference and preserve the intra-class consistency. Experimental results on five benchmark datasets demonstrate that the proposed method without any post-processing performs favorably against 23 state-of-the-art approaches. The source code will be publicly available at https://github.com/lartpang/MINet.
Existing lightweight salient object detection (SOD) methods aim to solve the problem of high computational costs that is prevalent with heavyweight methods. However, compared with heavyweight methods, the detection accuracy of lightweight methods is greatly reduced while real-time performance is not significantly improved. Therefore, we aim to establish a trade off between computational cost and detection performance by improving the network efficiency. We propose a fast and extremely lightweight end-to-end wavelet neural network (ELWNet) for real-time salient object detection. ELWNet can achieve salient object detection and segmentation at approximately 70FPS (GPU), 19FPS (CPU) with 76K parameters and 0.38G FLOPs. We introduce wavelet transform theory into a neural network, proposing a wavelet transform module (WTM), a wavelet transform fusion module (WTFM), a novel feature residual mechanism, and construct an efficient architecture. The wavelet transform theory is integrated into the neural network to realize the interaction between the features in the frequency and the time domain. Meanwhile, ELWNet does not rely on a pre-trained model, which significantly reduces redundant features. We validate the performance of ELWNet using five well-known datasets, and demonstrate state-of-the-art performance compared with 24 other SOD models in terms of being lightweight, detection accuracy and real-time capabilities. Our method maintains high detection performance while reducing the number of model parameters by approximately 99% compared with heavyweight methods.
Salient object detection (SOD) has rapidly developed in recent years, and detection performance has greatly improved. However, the price of these improvements is increasingly complex networks that require more computing resources and sacrifice real-time performance. This makes it difficult to deploy these approaches on devices with limited computing resources (such as mobile phones, embedded platforms, etc.). Considering recently developed lightweight SOD models, their detection and real-time performance are always compromised in demanding practical application scenarios. To solve these problems, we propose a novel lightweight SOD method called LARNet and its corresponding extremely lightweight method LARNet$^{*}$ according to application requirements. These methods balance the relationship between lightweight requirements, detection accuracy and real-time performance. First, we propose a saliency backbone network tailored for SOD, which removes the need for pre-training with ImageNet and effectively reduces feature redundancy. Subsequently, we propose a novel context gating module (CGM), which simulates the physiological mechanism of human brain neurons and visual information processing, and realizes the deep fusion of multi-level features at the global level. Finally, the saliency map is output after fusion of multi-level features. Extensive experiments on popular benchmark datasets demonstrate that the proposed LARNet (LARNet$^{*}$) achieves 98 (113) FPS on a GPU and 3 (6) FPS on a CPU. With approximately 680 K (90 K) parameters, the model has significant performance advantages over (extremely) lightweight methods, even surpassing some heavyweight models.
We solve the problem of salient object detection by investigating how to expand the role of pooling in convolutional neural networks. Based on the U-shape architecture, we first build a global guidance module (GGM) upon the bottom-up pathway, aiming at providing layers at different feature levels the location information of potential salient objects. We further design a feature aggregation module (FAM) to make the coarse-level semantic information well fused with the fine-level features from the top-down path- way. By adding FAMs after the fusion operations in the top-down pathway, coarse-level features from the GGM can be seamlessly merged with features at various scales. These two pooling-based modules allow the high-level semantic features to be progressively refined, yielding detail enriched saliency maps. Experiment results show that our proposed approach can more accurately locate the salient objects with sharpened details and hence substantially improve the performance compared to the previous state-of-the-arts. Our approach is fast as well and can run at a speed of more than 30 FPS when processing a 300×400 image. Code can be found at http://mmcheng.net/poolnet/.
Salient object detection (SOD) has been widely used in practical applications, such as multisensor image fusion, remote sensing, and defect detection. Recently, SOD from RGB and thermal (T) has been rapidly developed due to its robustness to extreme situations, such as low illumination and occlusion. However, existing methods all utilize a dual-stream encoder, which significantly increases the computation burdens and hinders real-world deployment. To this end, we propose a real-time One-stream Semantic-guided Refinement Network (OSRNet) for RGB-T SOD. Specifically, we first fuse the RGB and T via concatenation, addition, and multiplication operations to dig the complementary information between each modality. The efficient early fusion not only facilitates the information exchange between each modality but also avoids the cumbersome dual-stream encoder structure. Then, the lightweight decoder is proposed, making the high-level semantic information filter the low-level noisy features and gradually refine the final prediction. Also, we apply deep supervision to make the training procedure more stable and fast. Due to the early fusion strategy, OSRNet can run at a real-time speed (53–60 fps) on a single GPU. Extensive quantitative and qualitative experiments show that our network outperforms 11 state-of-the-art methods in terms of seven evaluation metrics. Our codes have been released at https://github.com/huofushuo/OSRNet.
Object detection is significant for event analysis in various intelligent multimedia processing systems. Although there have been many studies conducting research in this area, effective and efficient object detection methods for video sequences are still much desired. In this article, we investigate salient object detection in real-time multimedia processing systems. Considering the intrinsic relationship between top-down and bottom-up saliency features, we present a new effective method for video salient object detection based on deep semantic and spatiotemporal cues. After extracting top-down semantic features for object perception by a 2-D convolutional network, we concatenate them with bottom-up spatiotemporal cues for motion perception extracted by a 3-D convolutional network. In order to combine these features effectively, we feed them into a 3-D deconvolutional network for feature-sharing learning between semantic features and spatiotemporal cues for the final saliency prediction. Additionally, we propose a novel Gaussian-like loss function with an $L_{2}$-norm regularization term for parameter learning. Experimental results show that the proposed salient object detection approach performs better in terms of both effectiveness and efficiency for video sequences compared with the state-of-the-art models.
… Then, the saliency residual blocks (SalRBs) composed of one residual unit with stride= 2 (ResUnitS2) and two ResUnitS1s are utilized to capture low-resolution features with higher …
In recent years, with the advancement of cloud computing technology and the availability of cheaper hardware, surveillance systems have become more and more common. Unfortunately, most existing systems still face many limitations, such as latency and real-time analysis issues, etc. Edge computing effectively expands the boundaries of cloud computing, migrating some computing and analysis tasks to the edge devices for execution. Edge device could perform video analysis, which may be a good solution. In this paper, we adopt the collaborative Cloud-Edge architecture to analyze surveillance video and extract video keyframes for compressing video data at the edge. Then, we provide a residual U-net neural network to perform salient object detection on the extracted keyframes. Finally, we utilize the deep reinforcement learning Asynchronous Advantage Actor-Critic (A3C) algorithm to perform the residual U-net tasks scheduling, adaptive offloading in the cloud or edge, reducing system latency, and improving real-time performance. We verified the system performance using real road surveillance videos and other public datasets. The experiment results are inspiring. It proves that the real-time processing of the surveillance video system based on a collaborative cloud-edge mechanism could obtain the optimal result within the range of tolerable latency.
RGB-D Salient Object Detection (SOD) aims to segment the most prominent areas and objects in a given pair of RGB and depth images. Most current models adopt a dual-stream structure to extract information from both RGB and depth images. However, this leads to an exponential increase in the number of parameters and computations in the model. Moreover, the discrepancy between RGB pretrained and the 3D geometric relationships in depth maps present a challenge for the encoder in capturing spatial structural details. These issues impact the model's accuracy in locating salient objects and distinguishing edge details. To address these, we propose a novel early feature fusion network, named FasterSal, which enables more efficient RGB-D SOD. FasterSal uses a single stream structure to receive RGB images and depth maps, extracting features based on the 3D geometric relationships in the depth map while fully leveraging the pretrained RGB encoder. This approach effectively avoids the inconsistencies between depth modality and the RGB pretrained encoder. It also significantly reduces the number of network parameters while maintaining efficient feature encoding capabilities. To achieve finer edge learning, the detail-aware loss and texture enhancement module are introduced. These modules are designed to extract latent details in high-frequency component features and to enhance the edge learning capability of the model using distance information. Experimental results on several benchmark datasets confirm the effectiveness and superiority of our method over the state-of-the-art approaches, achieving a good balance between performance and speed with only 3.4 million parameters and a CPU operating speed of 63 FPS.
… Salient object detection in optical remote sensing images (RSI-SOD) aims to segment objects that attract human attention in optical RSIs. With the tremendous success of full …
Convolutional Neural Networks (CNNs) rely on contentindependent convolution operations that extract features shared across the entire dataset, limiting their adaptability to individual inputs. In contrast, input-dependent architectures like Vision Transformers (ViTs) can adapt to the specific characteristics of each input. To enhance input adaptability in CNNs, we propose SODDCNet, an encoderdecoder architecture for Salient Object Detection (SOD) that employs large convolutions with dynamically generated weights via the self-attention mechanism. Additionally, unlike other CNN architectures, we utilize multiple large kernels in parallel to segment salient objects of various sizes. To pre-train the proposed model, we combine the COCO and OpenImages semantic segmentation datasets to create a 3.18M image dataset for SOD. Comprehensive quantitative experiments conducted on benchmark datasets demonstrate that SODDCNet performs competitively compared to state-of-the-art methods in SOD and Video SOD. The code and pre-computed saliency maps are provided here.
Abstract Salient object detection is a critical and active field that aims at the detection of objects in a video, however, it draws increased attention among researchers. With increasing dynamic video data, the performance of saliency object detection method has been degrading with conventional object detection methods. The challenges lie with blurry moving targets, rapid movement of objects and background occlusion or dynamic background change on foreground regions in video frames. Such challenges result in poor saliency detection. In this paper, we design a deep learning model to address the issues, which uses a novel framework by combining the idea of Convolutional Neural Network (CNN) with Recurrent Neural Network (RNN) for video saliency detection. The proposed method aims at developing a spatiotemporal model that exploits temporal, spatial and local constraint cues to achieve global optimization. The task of finding the salient objects in benchmark dynamic video datasets is then carried out by capturing the temporal, spatial and local constraint features with the Convolution Recurrent Neural Network (CRNN). The CRNN is evaluated on benchmark datasets against conventional video salient object detection methods in terms of precision, F-measure, mean absolute error (MAE) and computational load. The experiments reveal that the CRNN model achieves improved performance than other state-of-the-art saliency models in terms of increased speed and reduced computational load.
RGB-D salient object detection (SOD) recently has attracted increasing research interest and many deep learning methods based on encoder-decoder architectures have emerged. However, most existing RGB-D SOD models conduct feature fusion either in the single encoder or the decoder stage, which hardly guarantees sufficient cross-modal fusion ability. In this paper, we make the first attempt in addressing RGB-D SOD through 3D convolutional neural networks. The proposed model, named RD3D, aims at pre-fusion in the encoder stage and in-depth fusion in the decoder stage to effectively promote the full integration of RGB and depth streams. Specifically, RD3D first conducts pre-fusion across RGB and depth modalities through an inflated 3D encoder, and later provides in-depth feature fusion by designing a 3D decoder equipped with rich back-projection paths (RBPP) for leveraging the extensive aggregation ability of 3D convolutions. With such a progressive fusion strategy involving both the encoder and decoder, effective and thorough interaction between the two modalities can be exploited and boost the detection accuracy. Extensive experiments on six widely used benchmark datasets demonstrate that RD3D performs favorably against 14 state-of-the-art RGB-D SOD approaches in terms of four key evaluation metrics. Our code will be made publicly available: https://github.com/PPOLYpubki/RD3D.
Depth information has been widely used to improve RGB-D salient object detection by extracting attention maps to determine the position information of objects in an image. However, non-salient objects may be close to the depth sensor and present high pixel intensities in the depth maps. This situation in depth maps inevitably leads to erroneously emphasize non-salient areas and may have a negative impact on the saliency results. To mitigate this problem, we propose a hybrid attention neural network that fuses middle- and high-level RGB features with depth features to generate a hybrid attention map to remove background information. The proposed network extracts multilevel features from RGB images using the Res2Net architecture and then integrates high-level features from depth maps using the Inception-v4-ResNet2 architecture. The mixed high-level RGB features and depth features generate the hybrid attention map, which is then multiplied to the low-level RGB features. After decoding by several convolutions and upsampling, we obtain the final saliency prediction, achieving state-of-the-art performance on the NJUD and NLPR datasets. Moreover, the proposed network has good generalization ability compared with other methods. An ablation study demonstrates that the proposed network effectively performs saliency prediction even when non-salient objects interfere detection. In fact, after removing the branch with high-level RGB features, the RGB attention map that guides the network for saliency prediction is lost, and all the performance measures decline. The resulting prediction map from the ablation study shows the effect of non-salient objects close to the depth sensor. This effect is not present when using the complete hybrid attention network. Therefore, RGB information can correct and supplement depth information, and the corresponding hybrid attention map is more robust than using a conventional attention map constructed only with depth information.
Depth data containing a preponderance of discriminative power in location have been proven beneficial for accurate saliency prediction. However, RGB-D saliency detection methods are also negatively influenced by randomly distributed erroneous or missing regions on the depth map or along the object boundaries. This offers the possibility of achieving more effective inference by well designed models. In this paper, we propose a new framework for accurate RGB-D saliency detection taking account of local and global complementarities from two modalities. This is achieved by designing a complimentary interaction model discriminative enough to simultaneously select useful representation from RGB and depth data, and meanwhile to refine the object boundaries. Moreover, we proposed a compensation-aware loss to further process the information not being considered in the complimentary interaction model, leading to improvement of the generalization ability for challenging scenes. Experiments on six public datasets show that our method outperforms18state-of-the-art methods.
Complex backgrounds and similar appearances between objects and their surroundings are generally recognized as challenging scenarios in Salient Object Detection (SOD). This naturally leads to the incorporation of depth information in addition to the conventional RGB image as input, known as RGB-D SOD or depth-aware SOD. Meanwhile, this emerging line of research has been considerably hindered by the noise and ambiguity that prevail in raw depth images. To address the aforementioned issues, we propose a Depth Calibration and Fusion (DCF) framework that contains two novel components: 1) a learning strategy to calibrate the latent bias in the original depth maps towards boosting the SOD performance; 2) a simple yet effective cross reference module to fuse features from both RGB and depth modalities. Extensive empirical experiments demonstrate that the proposed approach achieves superior performance against 27 state-of-the-art methods. Moreover, our depth calibration strategy alone can work as a preprocessing step; empirically it results in noticeable improvements when being applied to existing cutting-edge RGB-D SOD models. Source code is available at https://github.com/jiwei0921/DCF.
… Besides, we utilize a U-Net [53] structure to construct the modality-specific decoder, where the skip connections between the encoder and decoder layers are used to combine …
… of salient object detection (SOD) methods in challenging scenes. However, existing methods are specially designed for RGB-D … interaction fusion framework for RGB-D and RGB-T SOD, …
RGB-depth (RGB-D) salient object detection (SOD) recently has attracted increasing research interest, and many deep learning methods based on encoder–decoder architectures have emerged. However, most existing RGB-D SOD models conduct explicit and controllable cross-modal feature fusion either in the single encoder or decoder stage, which hardly guarantees sufficient cross-modal fusion ability. To this end, we make the first attempt in addressing RGB-D SOD through 3-D convolutional neural networks. The proposed model, named RD3D, aims at prefusion in the encoder stage and in-depth fusion in the decoder stage to effectively promote the full integration of RGB and depth streams. Specifically, RD3D first conducts prefusion across RGB and depth modalities through a 3-D encoder obtained by inflating 2-D ResNet and later provides in-depth feature fusion by designing a 3-D decoder equipped with rich back-projection paths (RBPPs) for leveraging the extensive aggregation ability of 3-D convolutions. Toward an improved model RD3D+, we propose to disentangle the conventional 3-D convolution into successive spatial and temporal convolutions and, meanwhile, discard unnecessary zero padding. This eventually results in a 2-D convolutional equivalence that facilitates optimization and reduces parameters and computation costs. Thanks to such a progressive-fusion strategy involving both the encoder and the decoder, effective and thorough interactions between the two modalities can be exploited and boost detection accuracy. As an additional boost, we also introduce channel-modality attention and its variant after each path of RBPP to attend to important features. Extensive experiments on seven widely used benchmark datasets demonstrate that RD3D and RD3D+ perform favorably against 14 state-of-the-art RGB-D SOD approaches in terms of five key evaluation metrics. Our code will be made publicly available at https://github.com/PPOLYpubki/RD3D.
Salient object detection is the pixel-level dense prediction task which can highlight the prominent object in the scene. Recently U-Net framework is widely used, and continuous convolution and pooling operations generate multi-level features which are complementary with each other. In view of the more contribution of high-level features for the performance, we propose a triplet transformer embedding module to enhance them by learning long-range dependencies across layers. It is the first to use three transformer encoders with shared weights to enhance multi-level features. By further designing scale adjustment module to process the input, devising three-stream decoder to process the output and attaching depth features to color features for the multi-modal fusion, the proposed triplet transformer embedding network (TriTransNet) achieves the state-of-the-art performance in RGB-D salient object detection, and pushes the performance to a new level. Experimental results demonstrate the effectiveness of the proposed modules and the competition of TriTransNet.
The existing saliency detection models based on RGB colors only leverage appearance cues to detect salient objects. Depth information also plays a very important role in visual saliency detection and can supply complementary cues for saliency detection. Although many RGB-D saliency models have been proposed, they require to acquire depth data, which is expensive and not easy to get. In this paper, we propose to estimate depth information from monocular RGB images and leverage the intermediate depth features to enhance the saliency detection performance in a deep neural network framework. Specifically, we first use an encoder network to extract common features from each RGB image and then build two decoder networks for depth estimation and saliency detection, respectively. The depth decoder features can be fused with the RGB saliency features to enhance their capability. Furthermore, we also propose a novel dense multiscale fusion model to densely fuse multiscale depth and RGB features based on the dense ASPP model. A new global context branch is also added to boost the multiscale features. Experimental results demonstrate that the added depth cues and the proposed fusion model can both improve the saliency detection performance. Finally, our model not only outperforms state-of-the-art RGB saliency models, but also achieves comparable results compared with state-of-the-art RGB-D saliency models.
Salient object detection (SOD) is an important preprocessing operation for various computer vision tasks. Most of existing RGB-D SOD models employ additive or connected strategies to directly aggregate and decode multi-scale features to predict salient maps. However, due to the large differences between the features of different scales, these aggregation strategies adopted may lead to information loss or redundancy, and few methods explicitly consider how to establish connections between features at different scales in the decoding process, which consequently deteriorates the detection performance of the models. To this end, we propose a cascaded and aggregated Transformer Network (CATNet) which consists of three key modules, i.e., attention feature enhancement module (AFEM), cross-modal fusion module (CMFM) and cascaded correction decoder (CCD). Specifically, the AFEM is designed on the basis of atrous spatial pyramid pooling to obtain multi-scale semantic information and global context information in high-level features through dilated convolution and multi-head self-attention mechanism, enhancing high-level features. The role of the CMFM is to enhance and thereafter fuse the RGB features and depth features, alleviating the problem of poor-quality depth maps. The CCD is composed of two subdecoders in a cascading fashion. It is designed to suppress noise in low-level features and mitigate the differences between features at different scales. Moreover, the CCD uses a feedback mechanism to correct and repair the output of the subdecoder by exploiting supervised features, so that the problem of information loss caused by the upsampling operation during the multi-scale features aggregation process can be mitigated. Extensive experimental results demonstrate that the proposed CATNet achieves superior performance over 14 state-of-the-art RGB-D methods on 7 challenging benchmarks.
… resolution RGB-D saliency dataset, HiBo-UA, containing 1,515 RGB-D image pairs captured in real-life scenarios. To our best knowledge, this is the first high-resolution RGB-D saliency …
本次调研梳理了U-Net及其变体在显著性检测领域的广泛应用,研究趋势从基础架构演进扩展至多模态融合、高精度边缘保持、Transformer全局建模、实时轻量化以及视频时空动态分析。这些研究路径相互补充,共同推动显著性检测向鲁棒性更强、细节还原度更高及实时性能更佳的工业级应用方向发展。