全景图像增强,全景图像超分
通用图像超分辨率基础理论与深度架构优化
该组文献构成了全景增强技术的底层支撑,涵盖了从经典的卷积神经网络(CNN)、残差网络(EDSR/RDN)到先进的Transformer(SwinIR/TTST)和生成对抗网络(ESRGAN)。研究重点包括学习下采样、效率优化(NTIRE挑战赛)、感知损失平衡以及针对真实噪声的鲁棒性建模。
- The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report(Bin Ren, Yawei Li, Nancy Mehta, Radu Timofte, Hongyuan Yu, Cheng Wan, Yuxin Hong, Bingnan Han, Zhuoyuan Wu, Yajun Zou, Yuqing Liu, Jizhe Li, Keji He, Chao Fan, Heng Zhang, Xiaolin Zhang, Xuanwu Yin, Kunlong Zuo, Bohao Liao, Peizhe Xia, Long Peng, Zhibo Du, Xin Di, Wangkai Li, Yang Wang, Wei Zhai, Renjing Pei, Jiaming Guo, Songcen Xu, Yang Cao, Zhengjun Zha, Yan Wang, Yi Liu, Qing Wang, Gang Zhang, Liou Zhang, Shijie Zhao, Long Sun, Jinshan Pan, Jiangxin Dong, Jinhui Tang, Xin Liu, Min Yan, Qian Wang, Menghan Zhou, Yiqiang Yan, Yixuan Liu, Wensong Chan, Dehua Tang, Dong Zhou, Li Wang, Lu Tian, Barsoum Emad, Bohan Jia, Junbo Qiao, Yunshuai Zhou, Yun Zhang, Wei Li, Shaohui Lin, Shenglong Zhou, Binbin Chen, Jincheng Liao, Suiyi Zhao, Zhao Zhang, Bo Wang, Yan Luo, Yanyan Wei, Feng Li, Mingshen Wang, Yawei Li, Jinhan Guan, Dehua Hu, Jiawei Yu, Qisheng Xu, Tao Sun, Long Lan, Kele Xu, Xin Lin, Jingtong Yue, Lehan Yang, Shiyi Du, Lu Qi, Chao Ren, Zeyu Han, Yuhan Wang, Chaolin Chen, Haobo Li, Mingjun Zheng, Zhongbao Yang, Lianhong Song, Xingzhuo Yan, Minghan Fu, Jingyi Zhang, Baiang Li, Qi Zhu, Xiaogang Xu, Dan Guo, Chunle Guo, Jiadi Chen, Huanhuan Long, Chunjiang Duanmu, Xiaoyan Lei, Jie Liu, Weilin Jia, Weifeng Cao, Wenlong Zhang, Yanyu Mao, Ruilong Guo, Nihao Zhang, Qian Wang, Manoj Pandey, Maksym Chernozhukov, Giang Le, Shuli Cheng, Hongyuan Wang, Ziyan Wei, Qingting Tang, Liejun Wang, Yongming Li, Yanhui Guo, Hao Xu, Akram Khatami-Rizi, Ahmad Mahmoudi-Aznaveh, Chih-Chung Hsu, Chia-Ming Lee, Yi-Shiuan Chou, Amogh Joshi, Nikhil Akalwadi, Sampada Malagi, Palani Yashaswini, Chaitra Desai, Ramesh Ashok Tabib, Ujwala Patil, Uma Mudenagudi, 2024, ArXiv Preprint)
- Residual Dense Network for Image Super-Resolution(Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, Y. Fu, 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition)
- Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network(C. Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew P. Aitken, Alykhan Tejani, J. Totz, Zehan Wang, Wenzhe Shi, 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR))
- SwinIR: Image Restoration Using Swin Transformer(Jingyun Liang, Jie Cao, Guolei Sun, K. Zhang, L. Gool, R. Timofte, 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW))
- ShuffleMixer: An Efficient ConvNet for Image Super-Resolution(Long Sun, Jinshan Pan, Jinhui Tang, 2022, ArXiv Preprint)
- TTST: A Top-k Token Selective Transformer for Remote Sensing Image Super-Resolution(Yi Xiao, Qiangqiang Yuan, Kui Jiang, Jiang He, Chia-Wen Lin, Liangpei Zhang, 2024, IEEE Transactions on Image Processing)
- From Coarse to Fine: Hierarchical Pixel Integration for Lightweight Image Super-Resolution(Jie Liu, Chao Chen, Jie Tang, Gangshan Wu, 2022, ArXiv Preprint)
- Multi-Scale Implicit Transformer with Re-parameterize for Arbitrary-Scale Super-Resolution(Jinchen Zhu, Mingjian Zhang, Ling Zheng, Shizhuang Weng, 2024, ArXiv Preprint)
- NTIRE 2017 Challenge on Single Image Super-Resolution: Dataset and Study(E. Agustsson, R. Timofte, 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
- Detail Loss in Super-Resolution Models Based on the Laplacian Pyramid and Repeated Upscaling and Downscaling Process(Sangjun Han, Youngmi Hur, 2026, ArXiv Preprint)
- Perceptual Losses for Real-Time Style Transfer and Super-Resolution(Justin Johnson, Alexandre Alahi, Li Fei-Fei, 2016, ArXiv)
- Learned Image Downscaling for Upscaling using Content Adaptive Resampler(Wanjie Sun, Zhenzhong Chen, 2019, ArXiv Preprint)
- Beware of Aliases -- Signal Preservation is Crucial for Robust Image Restoration(Shashank Agnihotri, Julia Grabinski, Janis Keuper, Margret Keuper, 2024, ArXiv Preprint)
- Multi-level Encoder-Decoder Architectures for Image Restoration(Indra Deep Mastan, Shanmuganathan Raman, 2019, ArXiv Preprint)
- OARS: Process-Aware Online Alignment for Generative Real-World Image Super-Resolution(Shijie Zhao, Xuanyu Zhang, Bin Chen, Weiqi Li, Qunliang Xing, Kexin Zhang, Yan Wang, Junlin Li, Li Zhang, Jian Zhang, Tianfan Xue, 2026, ArXiv Preprint)
- Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network(Wenzhe Shi, Jose Caballero, Ferenc Huszár, J. Totz, Andrew P. Aitken, Rob Bishop, D. Rueckert, Zehan Wang, 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR))
- Efficient Mixed Transformer for Single Image Super-Resolution(Ling Zheng, Jinchen Zhu, Jinpeng Shi, Shizhuang Weng, 2023, ArXiv Preprint)
- NTIRE 2021 Challenge on Burst Super-Resolution: Methods and Results(Goutam Bhat, Martin Danelljan, Radu Timofte, Kazutoshi Akita, Wooyeong Cho, Haoqiang Fan, Lanpeng Jia, Daeshik Kim, Bruno Lecouat, Youwei Li, Shuaicheng Liu, Ziluan Liu, Ziwei Luo, Takahiro Maeda, Julien Mairal, Christian Micheloni, Xuan Mo, Takeru Oba, Pavel Ostyakov, Jean Ponce, Sanghyeok Son, Jian Sun, Norimichi Ukita, Rao Muhammad Umer, Youliang Yan, Lei Yu, Magauiya Zhussip, Xueyi Zou, 2021, ArXiv Preprint)
- Enhanced Deep Residual Networks for Single Image Super-Resolution(Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, Kyoung Mu Lee, 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
- DRCT: Saving Image Super-resolution away from Information Bottleneck(Chih-Chung Hsu, Chia-Ming Lee, Yi-Shiuan Chou, 2024, ArXiv Preprint)
- NegVSR: Augmenting Negatives for Generalized Noise Modeling in Real-World Video Super-Resolution(Yexing Song, Meilin Wang, Zhijing Yang, Xiaoyu Xian, Yukai Shi, 2023, ArXiv Preprint)
- NTIRE 2021 Challenge on Video Super-Resolution(Sanghyun Son, Suyoung Lee, Seungjun Nah, Radu Timofte, Kyoung Mu Lee, 2021, ArXiv Preprint)
- Image Super-Resolution via Iterative Refinement(Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, Mohammad Norouzi, 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence)
- On the unreasonable vulnerability of transformers for image restoration -- and an easy fix(Shashank Agnihotri, Kanchana Vaishnavi Gandikota, Julia Grabinski, Paramanand Chandramouli, Margret Keuper, 2023, ArXiv Preprint)
- Unsupervised Real Image Super-Resolution via Generative Variational AutoEncoder(Zhi-Song Liu, Wan-Chi Siu, Li-Wen Wang, Chu-Tak Li, Marie-Paule Cani, Yui-Lam Chan, 2020, ArXiv Preprint)
- ESRGAN: Enhanced Super-Resolution Generative Adversarial Networks(Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Chen Change Loy, Y. Qiao, Xiaoou Tang, 2018, No journal)
- Asymmetric CNN for image super-resolution(Chunwei Tian, Yong Xu, Wangmeng Zuo, Chia-Wen Lin, David Zhang, 2021, ArXiv Preprint)
- Towards Lightweight Super-Resolution with Dual Regression Learning(Yong Guo, Mingkui Tan, Zeshuai Deng, Jingdong Wang, Qi Chen, Jiezhang Cao, Yanwu Xu, Jian Chen, 2022, ArXiv Preprint)
- On The Classification-Distortion-Perception Tradeoff(Dong Liu, Haochen Zhang, Zhiwei Xiong, 2019, ArXiv Preprint)
- Principal Component Analysis Using Structural Similarity Index for Images(Benyamin Ghojogh, Fakhri Karray, Mark Crowley, 2019, ArXiv Preprint)
- Efficient super-resolution and applications to mosaics(A. Zomet, Shmuel Peleg, 2000, Proceedings 15th International Conference on Pattern Recognition. ICPR-2000)
- Cross-selection kernel regression for super-resolution fusion of complementary panoramic images(Lidong Chen, A. Basu, Maojun Zhang, Wei Wang, 2012, 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC))
针对全景几何畸变的失真感知与球形表征算法
该组是本主题的核心研究,专门解决等距柱状投影(ERP)带来的经纬度采样不均和极点畸变。研究涵盖了纬度自适应网络(LAU-Net)、变形卷积、球形隐式函数、Mamba架构在全景域的应用,以及跨投影(ERP与立方体贴图)的特征融合技术。
- SphereSR: $360^{\circ}$ Image Super-Resolution with Arbitrary Projection via Continuous Spherical Image Representation(Youngho Yoon, Inchul Chung, Lin Wang, Kuk-Jin Yoon, 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- 360-Degree Image Super-Resolution Based on Single Image Sample and Progressive Residual Generative Adversarial Network(Liuyihui Qian, Xiaojun Liu, Juan Wu, Xiaoqing Xu, Han Zeng, 2022, 2022 7th International Conference on Image, Vision and Computing (ICIVC))
- Geometric Distortion Guided Transformer for Omnidirectional Image Super-Resolution(Cuixin Yang, Rongkang Dong, Jun Xiao, Cong Zhang, Kin-Man Lam, Fei Zhou, Guoping Qiu, 2024, IEEE Transactions on Circuits and Systems for Video Technology)
- MambaOSR: Leveraging Spatial-Frequency Mamba for Distortion-Guided Omnidirectional Image Super-Resolution(Weilei Wen, Qianqian Zhao, Xiuli Shao, 2025, Entropy)
- Omnidirectional image super-resolution via position attention network(Xin Wang, Shiqin Wang, Jinxing Li, Mu Li, Jinkai Li, Yong Xu, 2024, Neural networks : the official journal of the International Neural Network Society)
- Spherical Pseudo-Cylindrical Representation for Omnidirectional Image Super-resolution(Qing Cai, Mu-Wei Li, Dongwei Ren, Jun Lyu, Haiyong Zheng, Junyu Dong, Yee-Hong Yang, 2024, No journal)
- OmniSSR: Zero-shot Omnidirectional Image Super-Resolution using Stable Diffusion Model(Runyi Li, Xuhan Sheng, Weiqi Li, Jian Zhang, 2024, No journal)
- Lightweight omnidirectional super-resolution via frequency-spatial fusion and equirectangular projection correction(Dezhi Li, Yonglin Chen, Xingbo Dong, T. Ng, Zhe Jin, Shengyuan Wang, Wen Sha, 2025, Journal of Electronic Imaging)
- Super-resolution of Multi-view ERP 360-Degree Images with Two-Stage Disparity Refinement(Hee-Jae Kim, Jewon Kang, Byungchun Lee, 2020, 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC))
- Super-resolution from an omnidirectional image sequence(Hajime Nagahara, Yasushi Yagi, M. Yachida, 2000, 2000 26th Annual Conference of the IEEE Industrial Electronics Society. IECON 2000. 2000 IEEE International Conference on Industrial Electronics, Control and Instrumentation. 21st Century Technologies)
- A Single Frame and Multi-Frame Joint Network for 360-degree Panorama Video Super-Resolution(Hongying Liu, Zhubo Ruan, Chaowei Fang, Peng Zhao, Fanhua Shang, Yuanyuan Liu, Lijun Wang, 2020, ArXiv)
- Spatio-Temporal Distortion Aware Omnidirectional Video Super-Resolution(Hongyu An, Xinfeng Zhang, Li Zhang, Ruiqin Xiong, 2024, ArXiv)
- Omnidirectional Image Super-resolution via Bi-projection Fusion(Jiangang Wang, Yuning Cui, Yawen Li, Wenqi Ren, Xiaochun Cao, 2024, No journal)
- Applying VertexShuffle toward 360-degree video super-resolution(N. Li, Yao Liu, 2022, Proceedings of the 32nd Workshop on Network and Operating Systems Support for Digital Audio and Video)
- Enhanced Equirectangular Projection Images with Nonlinear Stretching(Manisha Mane, Anand Bhaskar, 2024, 2024 International Conference on Intelligent Systems and Advanced Applications (ICISAA))
- Co-projection-plane based 3-D padding for polyhedron projection for 360-degree video(Li Li, Zhu Li, Xiang Ma, Haitao Yang, Houqiang Li, 2017, 2017 IEEE International Conference on Multimedia and Expo (ICME))
- VertexShuffle-Based Spherical Super-Resolution for 360-Degree Videos(Na Li, Yao Liu, 2024, ACM Transactions on Multimedia Computing, Communications and Applications)
- EDM: Equirectangular Projection-Oriented Dense Kernelized Feature Matching(Dongki Jung, Jaehoon Choi, Yonghan Lee, Somi Jeong, Taejae Lee, Dinesh Manocha, Suyong Yeon, 2025, ArXiv Preprint)
- LAU-Net: Latitude Adaptive Upscaling Network for Omnidirectional Image Super-resolution(Xin Deng, Hao Wang, Mai Xu, Yichen Guo, Yuhang Song, Li Yang, 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- OSRT: Omnidirectional Image Super-Resolution with Distortion-aware Transformer(Fang Yu, Xintao Wang, Ming Cao, Gengyan Li, Y. Shan, Chao Dong, 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- DiffOSR: Latitude-aware conditional diffusion probabilistic model for omnidirectional image super-resolution(Leiming Liu, Ting Luo, Gangyi Jiang, Yeyao Chen, Haiyong Xu, Renzhi Hu, Zhouyan He, 2025, Knowl. Based Syst.)
- 360 Panorama Super-resolution using Deep Convolutional Networks(Vida Fakour Sevom, E. Guldogan, J. Kämäräinen, 2018, No journal)
- Latitude-oriented hierarchical enhancement network for omnidirectional image super-resolution(Xin Wang, Jinkai Li, Jinxing Li, Shiqin Wang, Yong Xu, 2025, Inf. Process. Manag.)
- Fast Omni-Directional Image Super-Resolution: Adapting the Implicit Image Function with Pixel and Semantic-Wise Spherical Geometric Priors(Xuelin Shen, Yitong Wang, Silin Zheng, Kang Xiao, Wenhan Yang, Xu Wang, 2025, ArXiv)
- OverallNet: Scale-Arbitrary Lightweight SR Model for handling 360° Panoramic Images(Dongsik Yoon, Jongeun Kim, Seonggeun Song, Yejin Lee, Gunhee Lee, 2024, SIGGRAPH Asia 2024 Posters)
- Dual Enhancement in ODI Super-Resolution: Adapting Convolution and Upsampling to Projection Distortion(Xiang Ji, Changqiao Xu, Lujie Zhong, Shujie Yang, Han Xiao, Gabriel-Miro Muntean, 2024, No journal)
- FATO: Frequency Attention Transformer for Omnidirectional Image Super-Resolution(Hongyu An, Xinfeng Zhang, Shijie Zhao, Li Zhang, 2024, Proceedings of the 6th ACM International Conference on Multimedia in Asia)
- OSRLLDM: Omnidirectional image super-resolution with latitude-aware latent diffusion models(Leiming Liu, Ting Luo, Yeyao Chen, Gangyi Jiang, Haiyong Xu, Renzhi Hu, Zhouyan He, 2025, Inf. Fusion)
- OPDN: Omnidirectional Position-aware Deformable Network for Omnidirectional Image Super-Resolution(Xiaopeng Sun, Weiqi Li, Zhenyu Zhang, Qiufang Ma, Xuhan Sheng, Ming-Hui Cheng, Haoyu Ma, Shijie Zhao, Jian Zhang, Junlin Li, Li Zhang, 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
- NTIRE 2023 Challenge on 360° Omnidirectional Image and Video Super-Resolution: Datasets, Methods and Results(Ming Cao, Chong Mou, Fang Yu, Xintao Wang, Yinqiang Zheng, Jian Zhang, Chao Dong, Gen Li, Ying Shan, R. Timofte, Xiaopeng Sun, Weiqi Li, Zhenyu Zhang, Xuhan Sheng, Bin Chen, Haoyu Ma, Ming-Hui Cheng, Shijie Zhao, Wanwan Cui, Tianyu Xu, Chunyang Li, Long Bao, Heng Sun, Huaibo Huang, Xiaoqiang Zhou, Yuang Ai, Ran He, Ren-Rong Wu, Yi Yang, Zhilu Zhang, Shuohao Zhang, Junyi Li, Yunjin Chen, Dongwei Ren, W. Zuo, Qian Wang, Hao-Hsiang Yang, Yi-Chung Chen, Zhi-Kai Huang, Wei-Ting Chen, Yuan Chiang, Hua-En Chang, I-Hsiang Chen, Chia-Hsuan Hsieh, Sy-Yen Kuo, Zebin Zhang, Jiaqi Zhang, Yuhui Wang, Shuhao Cui, Jun Steed Huang, Li Zhu, Shuman Tian, Wei-xing Yu, Bingchun Luo, 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
- Omnidirectional Image Super-Resolution via Latitude Adaptive Network(Xin Deng, H. Wang, Mai Xu, Li Li, Zulin Wang, 2023, IEEE Transactions on Multimedia)
- Multi-level distortion-aware deformable network for omnidirectional image super-resolution(Cuixin Yang, Rongkang Dong, Kin-Man Lam, Yuhang Zhang, Guoping Qiu, 2025, ArXiv)
- PanoExtend: An omnidirectional image super-resolution method based on spherical expansion(Xingtao Wang, Kaixin Wu, Jinyu Zhang, Yuxuan Wang, Wenrui Li, 2025, Proceedings of the 7th ACM International Conference on Multimedia in Asia)
- Geometric relationship-guided transformer network for omnidirectional image super-resolution(Junfeng Cao, Qinghai Ding, Haibo Luo, 2025, Signal, Image and Video Processing)
- Learning Local Implicit Fourier Representation for Image Warping(Jae-Won Lee, K. Choi, Kyong Hwan Jin, 2022, No journal)
面向流媒体传输与边缘设备的应用级系统优化
侧重于工业界落地,探讨如何在受限带宽和算力下提升360度视频体验。核心技术包括基于视口(FoV)的自适应超分、移动端能效建模(EOS)、边缘计算辅助增强以及神经视频压缩。通过云-边-端协同,实现低延迟、高质量的全景视频流式传输。
- L3BOU: Low Latency, Low Bandwidth, Optimized Super-Resolution Backhaul for 360-Degree Video Streaming(Ayush Sarkar, John Murray, Mallesham Dasari, M. Zink, K. Nahrstedt, 2021, 2021 IEEE International Symposium on Multimedia (ISM))
- EOS: Energy-Optimized Super-Resolution on Mobile Devices for Live 360-Degree Videos(Seonghoon Park, Minchan Kim, Hyejin Park, Jeho Lee, Jiwon Kim, Hojung Cha, 2025, Proceedings of the 31st Annual International Conference on Mobile Computing and Networking)
- Energy-Efficient Multi-User Adaptive 360$^{\circ }$ Video Streaming: A Two-Step Approach With Device Video Super-Resolution(Yannan Wei, Qiang Ye, Weihua Zhuang, Xuemin Shen, 2026, IEEE Transactions on Network Science and Engineering)
- Adaptive Super-Resolution Semantic Communication for Mobile AI-Generated Panoramic Video(Haixiao Gao, Mengying Sun, Yuantao Zhang, Haiming Wang, Xiaodong Xu, 2025, IEEE INFOCOM 2025 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS))
- FAESR: Fine-Grained Rate Adaptation for Energy-Aware Super Resolution in Mobile Panoramic Video Streaming(Tao Zhang, Yuxing Wei, Dong Jin, Shuangwu Chen, Yongyi Ran, Xiaobin Tan, Jian Yang, 2026, IEEE Transactions on Cognitive Communications and Networking)
- HVASR: Enhancing 360-degree video delivery with viewport-aware super resolution(Pingping Dong, Shangyu Li, X. Gong, Lianming Zhang, 2024, Inf. Sci.)
- OmniLive: Super-Resolution Enhanced 360° Video Live Streaming for Mobile Devices(Seonghoon Park, Yeonwoo Cho, Hyungchol Jun, Jeho Lee, Hojung Cha, 2023, Proceedings of the 21st Annual International Conference on Mobile Systems, Applications and Services)
- Streaming 360-Degree Videos Using Super-Resolution(Mallesham Dasari, A. Bhattacharya, Santiago Vargas, Pranjal Sahu, A. Balasubramanian, Samir R Das, 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications)
- VRFormer: 360-Degree Video Streaming with FoV Combined Prediction and Super resolution(Zhihao Zhang, Haipeng Du, Shouqin Huang, Weizhan Zhang, Qinghua Zheng, 2022, 2022 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom))
- SRA360: Super-Resolution Enhanced Adaptive 360-Degree Video Streaming for Heterogeneous Viewers(Zhonghui Wu, Lu Lu, Xingyan Chen, Yunxiao Ma, Shujie Yang, Mu Wang, Lujie Zhong, D. Wu, Changqiao Xu, 2025, IEEE Transactions on Network Science and Engineering)
- Live VR Panoramic Video Streaming Method Combing Viewport Prediction and Super-Resolution(Xiaolei Chen, Baoning Cao, Yubing Lu, Pengcheng Zhang, 2024, Journal of Computer-Aided Design & Computer Graphics)
- SR360: boosting 360-degree video streaming with super-resolution(Jiawen Chen, Miao Hu, Zhenxiao Luo, Zelong Wang, Di Wu, 2020, Proceedings of the 30th ACM Workshop on Network and Operating Systems Support for Digital Audio and Video)
- Central Vision based Super-resolution for 360-Degree Videos(Xiaoyan Wang, Tho Duc Nguyen, Chanh Minh Tran, Eiji Kamioka, Tan Xuan Phan, 2023, Proceedings of the 2023 7th International Conference on Big Data and Internet of Things)
- Adaptive 360-Degree Video Streaming with Super-Resolution and Interpolation(Siyuan Hong, Ruiqi Wang, Guohong Cao, 2025, 2025 IEEE Conference Virtual Reality and 3D User Interfaces (VR))
- Scalable Coding of 360-degree Video for Streaming Adaptation at 5G Network Edges(J. Carreira, S. Faria, Luís M. N. Tavora, A. Navarro, P. Assunção, 2020, 2020 IEEE International Symposium on Broadband Multimedia Systems and Broadcasting (BMSB))
- mmWave Networking and Edge Computing for Scalable 360° Video Multi-User Virtual Reality(Sabyasachi Gupta, Jacob Chakareski, P. Popovski, 2022, IEEE Transactions on Image Processing)
- Adaptive Cross-Modal Super-Resolution Semantic Communication for Mobile AI-Generated Panoramic Video(Haixiao Gao, Mengying Sun, Xiaodong Xu, Xiqi Cheng, Shujun Han, Ping Zhang, 2026, IEEE Transactions on Cognitive Communications and Networking)
- MAESR360: Masked autoencoder-based 360-degree video streaming via multi-scale feature fusion(Li Yu, Zhi Pang, M. Gabbouj, 2024, 2024 IEEE International Conference on Visual Communications and Image Processing (VCIP))
- Edge assisted frame interpolation and super resolution for efficient 360-degree video delivery(Chamara Madarasingha, Kanchana Thilakarathna, 2022, Proceedings of the 28th Annual International Conference on Mobile Computing And Networking)
- Neural Compression of 360-Degree Equirectangular Videos using Quality Parameter Adaptation(Daichi Arai, Yuichi Kondo, Kyohei Unno, Yasuko Sugito, Yuichi Kusakabe, 2025, ArXiv Preprint)
- RA360SR: A Real-time Acceleration-adaptive 360-degree Video Super-resolution System(Jiapeng Chi, D. Reiners, C. Cruz-Neira, 2022, 2022 IEEE International Symposium on Mixed and Augmented Reality Adjunct (ISMAR-Adjunct))
- Nexus: On-Device Lightweight Real-Time Super-Resolution System for 360 ◦ Videos(Yuchen Li, Xinyi Li, Tingjuan Lu, Jin Zhou, Shuang Li, Xiaoli Gong, Jin Zhang, 2026, Tsinghua Science and Technology)
全景内容生成、修复与三维场景重建
利用扩散模型(Diffusion)和生成式预训练模型进行全景图的外扩(Outpainting)、文本到全景图转换以及高质量内容补全。同时涵盖了面向元宇宙的3D高斯溅射(3DGS)以及从全景图像进行深度估计的研究,旨在构建沉浸式的虚拟环境。
- Pipeline for Text-to-Image Panoramic Road Scene Generation and Evaluation(Ryan Yan Hern Sim, Pai Chet Ng, Kan Chen, J. S. Lee, 2024, Adjunct Proceedings of the 16th International Conference on Automotive User Interfaces and Interactive Vehicular Applications)
- Diffusion360: Seamless 360 Degree Panoramic Image Generation based on Diffusion Models(Mengyang Feng, Jinlin Liu, Miaomiao Cui, Xuansong Xie, 2023, ArXiv)
- Panacea+: Panoramic and Controllable Video Generation for Autonomous Driving(Yuqing Wen, Yucheng Zhao, Yingfei Liu, Binyuan Huang, Fan Jia, Yanhui Wang, Chi Zhang, Tiancai Wang, Xiaoyan Sun, Xiangyu Zhang, 2024, IEEE Transactions on Circuits and Systems for Video Technology)
- Virtual Staging Technologies for the Metaverse(Muhammad Tukur, Yehia Boraey, Sara Jashari, A. Villanueva, Uzair Shah, Mahmood Al-Zubaidi, G. Pintore, Enrico Gobbetti, J. Schneider, Marco Agus, Noora Fetais, 2024, 2024 2nd International Conference on Intelligent Metaverse Technologies & Applications (iMETA))
- Cylin-Painting: Seamless 360° Panoramic Image Outpainting and Beyond(K. Liao, Xiangyu Xu, Chunyu Lin, Wenqi Ren, Yunchao Wei, Yao Zhao, 2022, IEEE Transactions on Image Processing)
- L-MAGIC: Language Model Assisted Generation of Images with Coherence(Zhipeng Cai, Matthias Mueller, R. Birkl, Diana Wofk, Shao-Yen Tseng, Junda Cheng, Gabriela Ben-Melech Stan, Vasudev Lal, Michael Paulitsch, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- OPa-Ma: Text Guided Mamba for 360-degree Image Out-painting(Penglei Gao, Kai Yao, Tiandi Ye, Steven Wang, Yuan Yao, Xiaofeng Wang, 2024, ArXiv)
- Target Scanpath-Guided 360-Degree Image Enhancement(Yujia Wang, Fang-Lue Zhang, N. Dodgson, 2025, No journal)
- Exploiting Diffusion Prior for Real-World Image Super-Resolution(Jianyi Wang, Zongsheng Yue, Shangchen Zhou, Kelvin C. K. Chan, Chen Change Loy, 2023, International Journal of Computer Vision)
- ResShift: Efficient Diffusion Model for Image Super-resolution by Residual Shifting(Zongsheng Yue, Jianyi Wang, Chen Change Loy, 2023, ArXiv)
- Splatter-360: Generalizable 360$^{\circ}$ Gaussian Splatting for Wide-baseline Panoramic Images(Zheng Chen, Chenming Wu, Zhelun Shen, Chen Zhao, Weicai Ye, Haocheng Feng, Errui Ding, Song-Hai Zhang, 2024, ArXiv Preprint)
- Estimating Depth of Monocular Panoramic Image with Teacher-Student Model Fusing Equirectangular and Spherical Representations(Jingguo Liu, Yijun Xu, Shigang Li, Jianfeng Li, 2024, ArXiv Preprint)
- Distortion-aware Depth Estimation with Gradient Priors from Panoramas of Indoor Scenes(Ruihong Yin, Sezer Karaoglu, T. Gevers, 2022, 2022 International Conference on 3D Vision (3DV))
- SPCNet: A Panoramic image depth estimation method based on spherical convolution(Sijin He, Yu Liu, Yumei Wang, 2021, 2021 International Conference on Visual Communications and Image Processing (VCIP))
- EGformer: Equirectangular Geometry-biased Transformer for 360 Depth Estimation(Ilwi Yun, Chanyong Shin, Hyunku Lee, Hyuk-Jae Lee, Chae Eun Rhee, 2023, ArXiv Preprint)
- BiFuse: Monocular 360 Depth Estimation via Bi-Projection Fusion(Fu-En Wang, Yu-Hsuan Yeh, Min Sun, Wei-Chen Chiu, Yi-Hsuan Tsai, 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
全景视觉质量评价、数据集构建与投影规范
研究如何评价存在几何扭曲的全景图质量,包括开发针对8K全景的盲质量评估指标(BOIQA)、频率感知评价。同时讨论了不同的投影模型(ERP, Cubemap等)对压缩效率和超分效果的影响,并提供了该领域专用的高质量数据集。
- Segmented Spherical Projection-Based Blind Omnidirectional Image Quality Assessment(Xuelei Zheng, G. Jiang, Mei Yu, Hao Jiang, 2020, IEEE Access)
- Local Visual and Global Deep Features Based Blind Stitched Panoramic Image Quality Evaluation Using Ensemble Learning(Yueli Cui, G. Jiang, Mei Yu, Yang Song, 2022, IEEE Transactions on Emerging Topics in Computational Intelligence)
- A no-reference panoramic image quality assessment with hierarchical perception and color features(Yun Liu, X. Yin, Chang Tang, Guanghui Yue, Yan Wang, 2023, J. Vis. Commun. Image Represent.)
- Frequency-Aware Native Resolution Assessment of 8K Omnidirectional Images(Jingwen Hou, Zengliang Li, Jiebin Yan, Weide Liu, Yuming Fang, Wei Zhou, 2025, 2025 International Conference on Visual Communications and Image Processing (VCIP))
- Quality Assessment of Super-Resolved Omnidirectional Image Quality Using Tangential Views(C. Ozcinar, A. Rana, 2021, ArXiv)
- Scale Guided Hypernetwork for Blind Super-Resolution Image Quality Assessment(Jun Fu, 2023, ArXiv Preprint)
- ODVISTA: An Omnidirectional Video Dataset for Super-Resolution and Quality Enhancement Tasks(Ahmed Telili, Ibrahim Farhat, W. Hamidouche, Hadi Amirpour, 2024, No journal)
- Saliency-driven rate-distortion optimization for 360-degree image coding(J. Chiang, Cheng-Yu Yang, Bhishma Dedhia, Yi-Fan Char, 2020, Multimedia Tools and Applications)
- A Comprehensive Comparison of Projections in Omnidirectional Super-Resolution(Huicheng Pi, Senmao Tian, Ming Lu, Jiaming Liu, Yandong Guo, Shunli Zhang, 2023, ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))
- Content-Aware Cubemap Projection for Panoramic Image via Deep Q-Learning(Zihao Chen, Xu Wang, Yu Zhou, Longhao Zou, Jianmin Jiang, 2019, No journal)
- Activation Map-based Vector Quantization for 360-degree Image Semantic Communication(Yan-yan Ma, Wenchi Cheng, Jingqing Wang, Wei Zhang, 2024, GLOBECOM 2024 - 2024 IEEE Global Communications Conference)
图像拼接、多传感器融合与垂直行业应用
聚焦于全景成像的完整流程与特殊落地场景。研究包括多摄像头视频拼接、水下全景增强、月面探测图像修复、工业管道监控以及体育教学中的虚拟现实增强,探讨了在复杂环境下如何结合超分技术提升全景视觉的可用性。
- Feature Matching-Based Undersea Panoramic Image Stitching in VR Animation(Yawen Tang, Jianhong Ren, 2023, Int. J. Image Graph.)
- Minimalist and High-Quality Panoramic Imaging With PSF-Aware Transformers(Qi Jiang, Shaohua Gao, Yao Gao, Kailun Yang, Zhonghua Yi, Haowen Shi, Lei Sun, Kaiwei Wang, 2023, IEEE Transactions on Image Processing)
- Improving Image Stitching Effect using Super-Resolution Technique(Jinjun Liu, 2024, International Journal of Advanced Computer Science and Applications)
- Channel-Spatial Attention Network for Lunar Image Super-Resolution(Yabo Duan, Huaizhan Li, Kefei Zhang, Shubi Zhang, Suqin Wu, 2022, Proceedings of the 2022 5th International Conference on Image and Graphics Processing)
- Applying super-resolution to panoramic mosaics(A. Zomet, Shmuel Peleg, 1998, Proceedings Fourth IEEE Workshop on Applications of Computer Vision. WACV'98 (Cat. No.98EX201))
- Super-resolution fusion of complementary panoramic images based on cross-selection kernel regression interpolation.(Lidong Chen, A. Basu, Maojun Zhang, Wei Wang, Yu Liu, 2014, Applied optics)
- Adaptive super-resolution framework for catadioptric omnidirectional images via cylindrical projection and latitude-aware networks(Jiaxu Zhang, Yaowen Lv, Yuxuan Wang, Xu Zhu, Yiming Hu, 2025, The Visual Computer)
- Research on Physical Education Teaching Mode in Colleges and Universities Based on VR Technology(Dezhi Kong, Aoyao Zhang, 2023, Applied Mathematics and Nonlinear Sciences)
- Research on super-resolution reconstruction for panoramic annular images(Yan Lou, Hui Li, Xinyi Qin, Zhipeng Ren, Shengya Zhao, Yihao Hou, Lun Jiang, 2024, No journal)
- Research on Fast Recognition Method for Robot Soccer Visual Images(Yaobo Long, Leiyu Pan, 2024, 2024 International Conference on Computers, Information Processing and Advanced Education (CIPAE))
- Unified auxiliary restoration network for robust multimodal 3D object detection in adverse conditions(Jae Hyun Yoon, Jong Won Jung, Seok Bong Yoo, 2025, Neural networks : the official journal of the International Neural Network Society)
全景图像增强与超分辨率领域已形成从底层通用超分、中层几何感知算法到上层流媒体系统优化的多级研究体系。当前的技术前沿呈现出以下特征:一是由单纯的像素提升转向深度的几何特性建模(如纬度感知与球形隐式表示);二是与生成式AI(扩散模型、Mamba)深度融合实现内容补全与扩展;三是系统层面的实时性与能效优化成为VR/AR商业化落地的关键;四是全景技术正向水下、航空、自动驾驶等垂直领域快速渗透。
总计166篇相关文献
No abstract available
: The existing VR panoramic video streaming methods based on viewport prediction have not effectively considered the temporal dimension information of VR video, and ignored the role that the client end should play in improving the video reconstruction quality. To further improve the comprehensive performance of the streaming system and the user quality of experience, a live VR panoramic video streaming method that combines viewport prediction and super-resolution reconstruction is proposed. At the server end, the method captures the long-distance dependency in the temporal dimension of video content features by embedding the proposed temporal non-local attention module TNAM into GhostNet to model the global context of VR panoramic video; At the client end, the proposed lightweight VR panoramic video super-resolution reconstruction model LVRSR is used to enhance the quality and optimize the projection distortion of the secondary content within the predicted viewport from the server end. The experimental results on the VR panoramic video user head mo-tion dataset show that the average viewport prediction accuracy and average bandwidth usage of the method are 95.6% and 52.9%, respectively. Compared with five representative streaming methods, the method can achieve higher viewport prediction accuracy and lower band-width usage, while having good video reconstruction quality and low computational resource consumption.
Panoramic annular image is an optical image that projects surrounding objects onto a circular area with the lens center as the focal point. It is widely used in petroleum pipeline monitoring system, but it is difficult to achieve high-quality images due to complex natural conditions in the surroundings, which affect the accuracy of pipeline monitoring system. To address this issue, we designed an optical image system which can obtain self-collected datasets. A dual-path network super-resolution algorithm is proposed, which utilizes an enhanced deep recursion network to extract detailed feature information from images, an attention mechanism network is utilized to extract important feature information from image in another path. Then the information of two paths is fused together to reconstruct high-resolution image. Experimental analysis and comparison were conducted on self-collected datasets using the algorithm, and the results demonstrated that our approach significantly improves image details and contributes to enhancing the accuracy of pipeline monitoring system.
Adaptive Cross-Modal Super-Resolution Semantic Communication for Mobile AI-Generated Panoramic Video
The development of artificial intelligence-generated panoramic video (AIGPV) provides a potential solution for the diverse demands of users in immersive communication scenarios of 6G. However, ensuring users’ quality of experience (QoE) under resource-constrained conditions remains a significant challenge. In this paper, we introduce a cross-modal semantic communication transmission scheme for mobile-AIGPV. This scheme generates low-resolution panoramic videos on the edge server and transmits them via cross-modal semantic communication. Furthermore, to better ensure users’ QoE, we propose a panoramic video super-resolution transmission framework based on deep joint source-channel coding (JSCC), named PVSR-JSCC. This framework leverages neural networks (NNs) and attention mechanisms to achieve semantic extraction and adaptive variable-length coding, and performs feature cyclic enhancement based on the motion and context, thereby achieving the panoramic video super-resolution task. Additionally, we design a cross-modal swin transformer block (CMSTB) to enhance the deep JSCC performance. The CMSTB integrates self-attention and cross-attention mechanisms, enabling deep JSCC to capture the intrinsic semantic correlation between text and video. Simulation results demonstrate that, compared to the semantic communication scheme directly transmitting high-resolution videos, our proposed PVSR-JSCC not only reduces bandwidth consumption by 36.7% but also lowers computational complexity of the semantic model by 42.8%. Moreover, it effectively prevents the “cliff effect” caused by the decrease of signal-to-noise ratio (SNR) in traditional communication systems.
Panoramic videos are shot by an omnidirectional camera or a collection of cameras, and can display a view in every direction. They can provide viewers with an immersive feeling. The study of super-resolution of panoramic videos has attracted much attention, and many methods have been proposed, especially deep learning-based methods. However, due to complex architectures of all the methods, they always result in a large number of hyperparameters. To address this issue, we propose the first lightweight super-resolution method with self-calibrated convolution for panoramic videos. A new deformable convolution module is designed first, with self-calibration convolution, which can learn more accurate offset and enhance feature alignment. Moreover, we present a new residual dense block for feature reconstruction, which can significantly reduce the parameters while maintaining performance. The performance of the proposed method is compared to those of the state-of-the-art methods, and is verified on the MiG panoramic video dataset.
Super resolution (SR) has been proposed to reduce the bandwidth overhead and improve the user’s quality of experience (QoE) for panoramic video. However, video reconstruction greatly increases the energy consumption on mobile devices with limited battery capacity, which is rarely considered in the existing works. In this work, we propose FAESR, a Fine-grained bitrate Adaptation method with an Energy-aware Super-Resolution to maximize the QoE and minimize the energy consumption. We propose an SR power model, which is the first model to evaluate the power consumption of SR on mobile devices through actual measurements. We formulate a joint optimization problem for QoE- and energy-aware panoramic video streaming. Most neural-enhanced panoramic streaming methods use coarse-grained adaptation, either selecting only download bitrate or assigning uniform bitrate to within the predicted field of view (FoV). This can lead to bandwidth waste due to overrated tiles incorrectly predicted to be within the FoV. We develop a fine-grained bitrate adaptation algorithm based on branching sequential DRL, which jointly optimizes download and reconstruction bitrates at the tile level. Evaluation results demonstrated that FAESR can significantly reduce energy consumption by 28% while improving the QoE by 12% compared to the existing state-of-the-art works.
The high resolution of omnidirectional images (ODIs) results in substantial acquisition costs, a challenge that can be alleviated via super-resolution techniques. Mainstream approaches to 2D planar image super-resolution convert images into 1D sequences via row-wise concatenation, achieving impressive performance. However, directly applying such dimension-reduction pipeline to ODIs suffers from projection distortion. To address this issue, this paper proposes PanoExpand, a method that facilitates super-resolution by bidirectionally converting between ODIs and 1D sequences. Based on spherical projection, PanoExpand transforms ODIs into 1D sequences from a 3D perspective, effectively eliminating inherent distortions while preserving spatial continuity. Specifically, PanoExpand takes an omnidirectional image as input, converts it into a 1D sequence via spherical unfolding, then iteratively refines the sequence through learning and training, and finally reconstructs it into a 2D image to obtain the super-resolved panoramic output. Experimental results demonstrate the superiority of PanoExpand in omnidirectional image super-resolution quality.
— This paper aims to present a novel methodology that merges image stitching with super-resolution techniques, enabling the creation of a high-resolution panoramic image from several low-resolution inputs. The proposed approach comprehensively addresses challenges throughout the process, encompassing image preprocessing, alignment and handling of mismatches, stitching, super-resolution reconstruction, and post-processing. Employing advanced methodologies such as Convolutional Neural Networks (CNNs), Scale-Invariant Feature Transform (SIFT), Random Sample Consensus (RANSAC), GrabCut algorithm, Super-Resolution Convolutional Neural Network (SRCNN), gradient domain optimization, and Structural Similarity Index Measure (SSIM), each step meticulously tackles specific issues inherent to image stitching tasks. A key innovation lies in the synergy of image stitching and super-resolution techniques, yielding a solution that boasts high robustness and efficiency. This versatile method is adaptable to diverse image processing contexts. To validate its effectiveness, experiments were conducted on two established datasets, USIS-D and VGG, where a quartet of quantitative metrics – Peak Signal-to-Noise Ratio (PSNR), SSIM, Entropy (EN), and Quality Assessment of Blurred Faces (QABF) – were employed to gauge the quality of stitched images against alternative methods. The outcomes decisively illustrate the superiority of our proposed method, achieving superior performance across all metrics and producing panoramas devoid of seams and distortions. This work thereby contributes a significant advancement in the realm of high-fidelity panoramic image reconstruction.
Stereoscopic omnidirectional images (SODIs) usually require recording very high-resolution (HR) information whereby it is beneficial to exploit a super-resolution (SR) scheme to super-resolve low-resolution (LR) SODIs. Compared with traditional 2-D SR approaches, the algorithms of SODI SR (SODI-SR) need to deal with two extra aspects: binocular information and panoramic characteristics. In this article, we first build a synthetic-specific SODI-SR dataset with LR-HR image pairs. Then, we propose a dynamic convolutions (Dconvs) and transformer network (Dconv-Trans-Net) for SODI-SR. Specifically, due to the nonuniform sampling of equirectangular projection (ERP), we deploy Dconvs with the structure of atrous spatial pyramid pooling (ASPP) to adaptively select content-aware and weight-aware kernels for patch-wise feature extraction. To capture diverse feature embeddings from left and right views, we utilize a symmetric bidirectional parallax attention module (biPAM) to extract local features along epipolar lines and propose cross-view transformer (CvTrans) to mine global contextual features apart from the epipolar line. Finally, quantitative and qualitative experiments demonstrate that our proposed approach outperforms the state-of-the-art (SOTA) panoramic or stereoscopic SR methods on the constructed SODI-SR dataset with two upscaling factors.
With the development of virtual reality headsets, panoramic videos have started to gain increased popularity, and 360-degree video streaming technologies and devices have achieved significant success in the market. However, the video resolution provided by existing mainstream panoramic video cameras is not high enough to give users a viewing experience close to that of a conventional display, and the network bandwidth is another bottleneck for high-resolution 360-degree video streaming. Researchers have introduced numerous approaches to resolve these issues, including viewport prediction and regional super-resolution. Nevertheless, existing methods can-not ensure that users always have high-resolution viewport content because of the inevitable prediction error and the heavy server-side pre-process consumption. In this paper, we present RA360SR, a real-time acceleration-adaptive 360-degree video super-resolution system. We develop a dual-camera system with Unity3D post-processing to implement the real-time super-resolution model processing. Additionally, to obtain a more stable frame rate, we propose an acceleration-adaptive approach to switch the super-resolution model processing status based on the acceleration of users' head movements. Our results show that RA360SR can deliver a sharper video viewing experience for users while providing an acceptable frame rate.
High-resolution images of the lunar surface are generally used to study the lunar soil and terrain. Nonetheless, acquiring higher-resolution images involves greater memory and calculation power, which is a challenge for lunar landers or rovers. In this study, a deep convolution neural network single image super-resolution reconstruction method based on channel-space attention is proposed to achieve the mapping of low-resolution lunar surface images to high-resolution lunar surface images. An enhanced block named feature fusion block is used for feature extraction. Furthermore, a channel-spatial attention module including efficient channel attention module and enhanced spatial attention module can extract more discriminative channel features and critical spatial features. Last, the model utilizes local implicit image function to predict RGB values of the images. The images of lander terrain camera and rover panoramic camera carried by Chang'e 3 and Chang'e 4 are used to training and validating models, and the Apollo rover images are used to test. The experiment demonstrates the superiority of the proposed novel model over the comparative method by using images of the Apollo project as the test dataset. Compared with the traditional methods, the PSNR value of the proposed method is improved by about 0.26dB in the 4x super-resolution reconstruction experiment.
No abstract available
No abstract available
No abstract available
The field of autonomous driving increasingly demands high-quality annotated video training data. In this paper, we propose Panacea+, a powerful and universally applicable framework for generating video data in driving scenes. Built upon the foundation of our previous work, Panacea, Panacea+ adopts a multi-view appearance noise prior mechanism and a super-resolution module for enhanced consistency and increased resolution. Extensive experiments show that the generated video samples from Panacea+ greatly benefit a wide range of tasks on different datasets, including 3D object tracking, 3D object detection, and lane detection tasks on the nuScenes and Argoverse 2 dataset. These results strongly prove Panacea+ to be a valuable data generation framework for autonomous driving.
Estimating room layouts from 360° panoramic images is an essential task in computer vision, enhancing 3D scene understanding from 2D images. To achieve and handle high-quality panoramas necessitates a Super-Resolution (SR) technique that is both light and capable of arbitrary-scale SR, optimized for efficient inference. In this study, we propose a simple model that employs a modular arbitrary-scale technique. Additionally, our model incorporates quantization to maximize efficiency during user inference, making it well-suited for processing high-quality panoramic images.
The Panoramic Road Scene Generation (PRSG) pipeline is a framework designed to generate realistic and contextually accurate panoramic road scenes from textual descriptions. This approach leverages a combination of a stable diffusion model and super-resolution techniques to produce panoramic images for autonomous vehicle (AV) simulations and VR/AR applications. By generating detailed panoramic road scenes viewable in a VR HMD, an additional avenue for easily creating and simulating vehicular scenarios is possible. The pipeline utilizes a customized dataset with specifically tailored captions to fine-tune the generation process, ensuring both visual fidelity and contextual relevance. The evaluation framework includes both subjective and objective metrics to assess to the quality and applicability of the generated images. A pilot study determined the generated road scenes to be relatively realistic and acceptable for viewing. Potential applications include the creation of new synthetic scenes for simulating AV scenarios, or training and validation of autonomous systems.
Panoramic images, with their wide field of view and abundant information, have become essential visual materials in digital art creation and virtual reality. However, existing panoramic image restoration and quality enhancement methods lack high-level semantic understanding and global feature control, which often leads to structural disorder in complex scenes. They also struggle to balance semantic comprehension, real-time performance, and restoration quality. To address these issues, this paper raises a panoramic image restoration and visual quality enhancement model for digital art creation. The model uses a Multi Scale Residual Network, a Coordinate Space Attention mechanism, and super resolution reconstruction to construct a visual quality enhancement algorithm, which accurately captures both local details and global structural features. Based on this algorithm, an optimized Generative Adversarial Network and a Vision Transformer are integrated to model the spatial correlation and semantic logic between damaged and undamaged regions, achieving high-quality completion. Experimental results show that the model achieves a Structural Similarity Index of 0.975 and 0.971, a Peak Signal to Noise Ratio of 53.82 dB and 53.75 dB, a maximum memory usage of 394 MB, and a response time of 3.12 s with a data volume of 2000 in DIV2K and SUN360 datasets. The model outperforms comparison models in all metrics, enhances both detail clarity and global consistency, and maintains efficient processing performance. It provides high-quality visual materials for digital art creation and shows significant advantages across various performance indicators.
High-quality panoramic images with a Field of View (FoV) of 360° are essential for contemporary panoramic computer vision tasks. However, conventional imaging systems come with sophisticated lens designs and heavy optical components. This disqualifies their usage in many mobile and wearable applications where thin and portable, minimalist imaging systems are desired. In this paper, we propose a Panoramic Computational Imaging Engine (PCIE) to achieve minimalist and high-quality panoramic imaging. With less than three spherical lenses, a Minimalist Panoramic Imaging Prototype (MPIP) is constructed based on the design of the Panoramic Annular Lens (PAL), but with low-quality imaging results due to aberrations and small image plane size. We propose two pipelines, i.e. Aberration Correction (AC) and Super-Resolution and Aberration Correction (SR&AC), to solve the image quality problems of MPIP, with imaging sensors of small and large pixel size, respectively. To leverage the prior information of the optical system, we propose a Point Spread Function (PSF) representation method to produce a PSF map as an additional modality. A PSF-aware Aberration-image Recovery Transformer (PART) is designed as a universal network for the two pipelines, in which the self-attention calculation and feature extraction are guided by the PSF map. We train PART on synthetic image pairs from simulation and put forward the PALHQ dataset to fill the gap of real-world high-quality PAL images for low-level vision. A comprehensive variety of experiments on synthetic and real-world benchmarks demonstrates the impressive imaging results of PCIE and the effectiveness of the PSF representation. We further deliver heuristic experimental findings for minimalist and high-quality panoramic imaging, in terms of the choices of prototype and pipeline, network architecture, training strategies, and dataset construction. Our dataset and code will be available at https://github.com/zju-jiangqi/PCIE-PART.
No abstract available
No abstract available
No abstract available
In the current era of generative AI breakthroughs, generating panoramic scenes from a single input image remains a key challenge. Most existing methods use diffusion-based iterative or simultaneous multi-view inpainting. However, the lack of global scene layout priors leads to subpar outputs with duplicated objects (e.g., multiple beds in a bedroom) or requires time-consuming human text inputs for each view. We propose L-MAGIC, a novel method leveraging large language models for guidance while diffusing multiple coherent views of 360° panoramic scenes. L-MAGIC harnesses pretrained diffusion and language models without fine-tuning, ensuring zero-shot performance. The output quality is further enhanced by super-resolution and multi-view fusion techniques. Extensive experiments demonstrate that the resulting panoramic scenes feature better scene layouts and perspective view rendering quality compared to related works, with >70% preference in human evaluations. Combined with conditional diffusion models, L-MAGIC can accept various input modalities, including but not limited to text, depth maps, sketches, and colored scripts. Applying depth estimation further enables 3D point cloud generation and dynamic scene exploration with fluid camera motion. Code is available at https://github.com/ZhipengCai/L-MAGIC-code-release.
We discuss virtual staging technologies, focusing on two primary pipelines for creating and exploring immersive indoor environments in the metavers: an AI-based image processing pipeline and a LIDAR-based pipeline. The AI-based image processing pipeline leverages advanced AI algorithms for tasks such as clutter removal, semantic style transfer, and super-resolution, enabling rapid generation of high-quality, photorealistic virtual environments from single panoramic images. The LIDAR-based pipeline captures measurable 3D models of indoor spaces, facilitating immersive editing and collaborative design through real-time interaction with high-fidelity virtual environments. A qualitative comparative analysis of these technologies highlights their strengths and limitations in various applications. The practical implications of these pipelines are discussed, particularly their potential to transform industries such as real estate, furniture retail, interior design, construction, remote collaboration, and immersive training. The paper concludes with suggestions for future research, including conducting user studies, integrating the two pipelines, and optimizing technologies for mobile and edge devices to enhance accessibility and usability.
The technologies of virtual reality (VR) and augmented reality (AR) have developed fast in recent years and have been applied in various scenarios, including the field of education. Now educators attempt to make use of these new technologies to create an immersive learning environment (ILE) for students, in the hopes of helping them improve learning efficiency and quality. Currently, available research on the design and optimization of ILE focus on how to introduce teaching content into the 3D environment and improve user experience, while a series of problems are still pending for solutions, such as how to arrange the layout in multi-dimensional space, how to optimize the vision in immersive environment based on edge strategy, and how to improve image quality using panoramic-view specific-region super-resolution (PSS) algorithms that use non-local feature fusion. In view of these matters, this study proposed a novel method for the design and optimization of an ILE. At first, problems of how to create a visual layout of network topology in the ILE, how to arrange the layout based on the advantages of different-dimensional spaces in scenes, and how to carry out visual optimization combined with the edge strategy were discussed in detail. Then, a new PSS algorithm that utilizes non-local feature fusion was proposed, and the functioning of the multi-scale non-local feature fusion module and the specific structure of the high-solution network were introduced in detail. The findings of this study could provide useful theoretical evidences and practical guidance for the design and optimization of ILE, giving new possibilities for improving the effect of ILE and enhancing user experience.
Abstract This paper utilizes VR technology to construct a virtual sports training situation and proposes a sports training feedback teaching mode based on VR technology. Aiming at the VR high-resolution scene generation technology, it analyzes the image super-resolution recovery algorithm based on multi-frames, introduces the regularization solution method, and, on this basis, proposes a generalized entropy variational image super-resolution recovery algorithm to realize better VR panoramic visual teaching. Taking Taekwondo movement training in K-school as the research content, we compared the learning attitude change, the influence of special physical quality, and the technical mastery ability of Taekwondo movement between the experimental group and the control group. The evaluation results found that the technical score of movement completion ability after the experiment was 94.98±2.36 in the experimental group and 85.79±5.32 in the control group. The quality of physical education teaching can be improved by using VR technology in teaching mode.
360° images have wide applications in fields such as virtual reality and user experience design. Our goal is to adjust these images to guide users' visual attention. To achieve this, we present a novel task: target scanpath-guided 360° image enhancement, which aims to enhance 360° images based on user-specified target scanpaths. We develop a Progressive Scanpath-Guided Enhancement Method (PSEM) to address this problem through three stages. In the first stage, we propose a Time-Alignment and Spatial Similarity Clustering (TASSC) algorithm that accounts for the spherical nature of 360° images and the temporal dependency of scanpaths to generate representative scanpaths. In the second stage, we learn the differences between the source and the target scanpaths and select the objects to be edited based on these differences. Particularly, we propose a Dual-Stream Scanpath Difference Encoder (DSDE) embedded into the Segment Anything Model (SAM) network for object mask generation. Finally, we employ a Stable Diffusion network fine-tuned with LoRA technology to produce the final enhanced image. Additionally, we design special loss functions to supervise the training of the second and third stages. Experimental results have demonstrated the effectiveness of our approach for scanpath-guided 360° image enhancement.
Omnidirectional (360-degree) video is rapidly gaining popularity due to advancements in immersive technologies like virtual reality (VR) and extended reality (XR). However, real-time streaming of such videos, especially in live mobile scenarios like unmanned aerial vehicles (UAVs), is challenged by limited bandwidth and strict latency constraints. Traditional methods, such as compression and adaptive resolution, help but often compromise video quality and introduce artifacts that degrade the viewer experience. Additionally, the unique spherical geometry of 360-degree video presents challenges not encountered in traditional 2D video. To address these issues, we initiated the 360-degree Video Super Resolution and Quality Enhancement Challenge. This competition encourages participants to develop efficient machine learning solutions to enhance the quality of low-bitrate compressed 360-degree videos, with two tracks focusing on 2x and 4x super-resolution (SR). In this paper, we outline the challenge framework, detailing the two competition tracks and highlighting the SR solutions proposed by the top-performing models. We assess these models within a unified framework, considering quality enhancement, bitrate gain, and computational efficiency. This challenge aims to drive innovation in real-time 360-degree video streaming, improving the quality and accessibility of immersive visual experiences.
We present a simple VR-specific image detail enhancement method that improves the viewing experience of 360-degree stereoscopic photographed VR contents. By exploiting the fusion characteristics of binocular vision, we propose an asymmetric process that applies detail enhancement to one single image channel only. Our method can dynamically apply the enhancement in a view-adaptive fashion in real-time on most low-cost standalone VR headsets. We discuss the benefits of this method with respect to authoring possibilities, storage and bandwidth issues of photographed VR contents.
360-degree video streaming is becoming increasingly popular for its immersive experience. Traditional adaptive tile-based streaming methods allocate the bitrates according to view-port prediction, which effectively reduces required transmission bandwidth, but it will cause serious quality degradation when the viewport prediction is inaccurate. Thus, some researchers propose visual reconstruction and enhancement-based 360-degree video streaming framework, which can reconstructs the whole frame at very low bitrates. However, existing frameworks are built upon image-based visual reconstruction methods, which do not fully consider the characteristics of videos. In this paper, we propose a masked autoencoder-based, multi-scale optimized framework for 360-degree video streaming (MAESR360), which fully considers the temporal relevance of the video. We utilize spatio-temporal downsampling and high-ratio tube masking strategies to effectively reduce the amount of transmitted data. Additionally, we design a lightweight visual reconstruction model based on multi-scale feature fusion to recover the visual quality of video frames. The effectiveness of our proposed method is demonstrated through extensive experiments.
The huge amount of data that is necessary to capture the full field-of-view (FoV) in omnidirectional video, i.e., 360°, imposes the use of highly efficient compressed formats as well as adaptive broadcast and streaming mechanisms, such as those foreseen for 5G networks. To cope with the demanding requirements of 360° video streaming over 5G networks, this work proposes a scalable 360° video coding architecture, by enabling adaptation through the Multi-Access Edge Computing (MEC) server in two different domains of the spherical visual content, namely spatial resolution and FoV. In the proposed architecture two-layers are encoded from the input 360° video content: (i) the base-layer (BL), encoding each 360° image as a whole, at a lower spatial resolution; (ii) the enhancement-layer (EL) encoding each spherical image as a set of multiple FoVs with higher spatial resolution. Such arrangement enables flexible stream adaptation for the smart decision algorithms to be implemented at the MEC server, enabling significant reduction of the overall bit rate through the radio interface. The simulation results show that the proposed scalable coding scheme allows a great deal of bit rate savings across the 5G network, achieving 36% of bit rate saving, on average, for a 90° FoV in comparison with conventional single-layer coding.
We investigate a novel multi-user mobile Virtual Reality (VR) arcade system for streaming scalable 8K 360° video with low interactive latency, while providing high remote scene immersion fidelity and application reliability. This is achieved through the integration of embedded multi-layer 360° tiling, edge computing, and wireless multi-connectivity that comprises sub-6 GHz and mmWave (millimeter wave) links. The sub-6 GHz band is used for broadcast of the base layer of the entire 360° panorama to all users, while the directed mmWave links are used for high-rate transmission of VR-enhancement layers that are specific to the viewports of the individual users. The viewport-specific enhancements can comprise compressed and raw 360° tiles, decoded first at the edge server. We aim to maximize the smallest immersion fidelity for the delivered 360 content across all VR users, given rate, latency and computing constraints. We characterize analytically the rate-distortion trade-offs across the spatiotemporal 360° panorama and the computing power required to decompress 360° tiles. The proposed solution consists of geometric programming algorithms and an intermediate step of graph-theoretic VR user to mmWave access point assignment. The results reveal a significant improvement (8–10 dB) in delivered VR user immersion fidelity and spatial resolution (8K vs. 4K) compared to a state-of-the-art method based on sub-6 GHz transmission only. We also show that an increasing number of raw 360° tiles are sent, as the mmWave network link data rate or the edge server/user computing power increase. Finally, we demonstrate that in order to hypothetically deliver the same immersion fidelity, the reference method would incur a much higher (2.5-4.5x) system latency.
Glossy reflections of the surroundings play a major role when trying to achieve a seamless fusion of real and virtual objects in Mixed Reality (MR) environments. Traditionally, the necessary information about the ambiance is captured using mirrored balls, HDR cameras, fish-eye lenses, RGB-D cameras or 360-degree cameras. While these approaches allow for pretty good results, they require a rather complex setup. Our approach is based on a single RGB camera capturing the environmental lighting at a certain location within the scene. Therefore, we apply a precomputation step generating a 360-degree environment map and combine it with a camera-based image stitching for a continuous enhancement and update of the lighting information. We show that our approach allows for realistic and high-quality reflections within an AR/MR environment in real time even on mobile devices.
In this paper, we tackle the recently popular topic of generating 360-degree images given the conventional narrow field of view (NFoV) images that could be taken from a single camera or cellphone. This task aims to predict the reasonable and consistent surroundings from the NFoV images. Existing methods for feature extraction and fusion, often built with transformer-based architectures, incur substantial memory usage and computational expense. They also have limitations in maintaining visual continuity across the entire 360-degree images, which could cause inconsistent texture and style generation. To solve the aforementioned issues, we propose a novel text-guided out-painting framework equipped with a State-Space Model called Mamba to utilize its long-sequence modelling and spatial continuity. Furthermore, incorporating textual information is an effective strategy for guiding image generation, enriching the process with detailed context and increasing diversity. Efficiently extracting textual features and integrating them with image attributes presents a significant challenge for 360-degree image out-painting. To address this, we develop two modules, Visual-textual Consistency Refiner (VCR) and Global-local Mamba Adapter (GMA). VCR enhances contextual richness by fusing the modified text features with the image features, while GMA provides adaptive state-selective conditions by capturing the information flow from global to local representations. Our proposed method achieves state-of-the-art performance with extensive experiments on two broadly used 360-degree image datasets, including indoor and outdoor settings.
In virtual reality (VR) applications, 360-degree images play a pivotal role in crafting immersive experiences and offering panoramic views, thus enhancing the visual experience of the user. However, the voluminous data generated by 360-degree images poses challenges in network storage and bandwidth. To address these challenges, we propose a novel Activation Map-based Vector Quantization (AM-VQ) framework, which is designed to reduce communication overhead for wireless transmission. The proposed AM-VQ scheme uses the Deep Neural Networks (DNNs) with vector quantization (VQ) to extract and compress semantic features. Particularly, the AM-VQ framework utilizes an activation map to adaptively quantize semantic features, thereby reducing data distortion caused by quantization. To further enhance the reconstruction quality of the 360-degree image, adversarial training with a Generative Adversarial Networks (GANs) discriminator is incorporated. Numerical results show that our proposed AM-VQ scheme achieves better performance than the existing Deep Learning (DL) based coding and the traditional coding schemes under the same transmission symbols.
Abstract. Omnidirectional images (ODIs) possess unique equirectangular projection geometric properties that pose challenges for traditional super-resolution methods. Existing ODI super-resolution (ODISR) models often struggle to effectively capture both spatial and frequency domain information, limiting their performance. We introduce a lightweight model that fuses information from both domains to enhance ODISR. A dual-domain attention mechanism tailored for ODISR is proposed, incorporating a reparameterized pixel attention module and a frequency-domain attention module. This approach achieves a balance between efficiency and reconstruction quality, reducing both time and space complexity. Experiments demonstrate that our model outperforms state-of-the-art lightweight ODISR models while maintaining competitive performance.
Omnidirectional Videos (or 360° videos) are widely used in Virtual Reality (VR) to facilitate immersive and interactive viewing experiences. However, the limited spatial resolution in 360° videos does not allow for each degree of view to be represented with adequate pixels, limiting the visual quality offered in the immersive experience. Deep learning Video Super-Resolution (VSR) techniques used for conventional videos could provide a promising software-based solution; however, these techniques do not tackle the distortion present in equirectangular projections of 360° video signals. An additional obstacle is the limited 360° video datasets to study. To address these issues, this paper creates a novel 360° Video Dataset (360VDS) with a study of the extensibility of conventional VSR models to 360° videos. This paper further proposes a novel deep learning model for 360° Video Super-Resolution (360° VSR), called Spherical Signal Super-resolution with a Proportioned Optimisation (S3PO). S3PO adopts recurrent modelling with an attention mechanism, unbound from conventional VSR techniques like alignment. With a purpose-built feature extractor and a novel loss-function addressing spherical distortion, S3PO outperforms most state-of-the-art conventional VSR models and 360° specific super-resolution models on 360° video datasets. A step-wise ablation study is presented to understand and demonstrate the impact of the chosen architectural sub-components, targeted training and optimisation.
In the context of Omni-Directional Image (ODI) Super-Resolution (SR), the unique challenge arises from the non-uniform oversampling characteristics caused by EquiRectangular Projection (ERP). Considerable efforts in designing complex spherical convolutions or polyhedron reprojection offer significant performance improvements but at the expense of cumbersome processing procedures and slower inference speeds. Under these circumstances, this paper proposes a new ODI-SR model characterized by its capacity to perform Fast and Arbitrary-scale ODI-SR processes, denoted as FAOR. The key innovation lies in adapting the implicit image function from the planar image domain to the ERP image domain by incorporating spherical geometric priors at both the latent representation and image reconstruction stages, in a low-overhead manner. Specifically, at the latent representation stage, we adopt a pair of pixel-wise and semantic-wise sphere-to-planar distortion maps to perform affine transformations on the latent representation, thereby incorporating it with spherical properties. Moreover, during the image reconstruction stage, we introduce a geodesic-based resampling strategy, aligning the implicit image function with spherical geometrics without introducing additional parameters. As a result, the proposed FAOR outperforms the state-of-the-art ODI-SR models with a much faster inference speed. Extensive experimental results and ablation studies have demonstrated the effectiveness of our design.
No abstract available
As augmented reality and virtual reality applications gain popularity, image processing for OmniDirectional Images (ODIs) has attracted increasing attention. OmniDirectional Image Super-Resolution (ODISR) is a promising technique for enhancing the visual quality of ODIs. Before performing super-resolution, ODIs are typically projected from a spherical surface onto a plane using EquiRectangular Projection (ERP). This projection introduces latitude-dependent geometric distortion in ERP images: distortion is minimal near the equator but becomes severe toward the poles, where image content is stretched across a wider area. However, existing ODISR methods have limited sampling ranges and feature extraction capabilities, which hinder their ability to capture distorted patterns over large areas. To address this issue, we propose a novel Multi-level Distortion-aware Deformable Network (MDDN) for ODISR, designed to expand the sampling range and receptive field. Specifically, the feature extractor in MDDN comprises three parallel branches: a deformable attention mechanism (serving as the dilation=1 path) and two dilated deformable convolutions with dilation rates of 2 and 3. This architecture expands the sampling range to include more distorted patterns across wider areas, generating dense and comprehensive features that effectively capture geometric distortions in ERP images. The representations extracted from these deformable feature extractors are adaptively fused in a multi-level feature fusion module. Furthermore, to reduce computational cost, a low-rank decomposition strategy is applied to dilated deformable convolutions. Extensive experiments on publicly available datasets demonstrate that MDDN outperforms state-of-the-art methods, underscoring its effectiveness and superiority in ODISR.
With the rapid development of virtual reality, omnidirectional images (ODIs) have attracted much attention from both the industrial community and academia. However, due to storage and transmission limitations, the resolution of current ODIs is often insufficient to provide an immersive virtual reality experience. Previous approaches address this issue using conventional 2D super-resolution techniques on equirectangular projection without exploiting the unique geometric properties of ODIs. In particular, the equirectangular projection (ERP) provides a complete field-of-view but introduces significant distortion, while the cubemap projection (CMP) can reduce distortion yet has a limited field-of-view. In this paper, we present a novel Bi-Projection Omnidirectional Image Super-Resolution (BPOSR) network to take advantage of the geometric properties of the above two projections. Then, we design two tailored attention methods for these projections: Horizontal Striped Transformer Block (HSTB) for ERP and Perspective Shift Transformer Block (PSTB) for CMP. Furthermore, we propose a fusion module to make these projections complement each other. Extensive experiments demonstrate that BPOSR achieves state-of-the-art performance on omnidirectional image super-resolution. The code is available at https://github.com/W-JG/BPOSR.
Omnidirectional images (ODIs) are commonly used in real-world visual tasks, and high-resolution ODIs help improve the performance of related visual tasks. Most existing super-resolution methods for ODIs use end-to-end learning strategies, resulting in inferior realness of generated images and a lack of effective out-of-domain generalization capabilities in training methods. Image generation methods represented by diffusion model provide strong priors for visual tasks and have been proven to be effectively applied to image restoration tasks. Leveraging the image priors of the Stable Diffusion (SD) model, we achieve omnidirectional image super-resolution with both fidelity and realness, dubbed as OmniSSR. Firstly, we transform the equirectangular projection (ERP) images into tangent projection (TP) images, whose distribution approximates the planar image domain. Then, we use SD to iteratively sample initial high-resolution results. At each denoising iteration, we further correct and update the initial results using the proposed Octadecaplex Tangent Information Interaction (OTII) and Gradient Decomposition (GD) technique to ensure better consistency. Finally, the TP images are transformed back to obtain the final high-resolution results. Our method is zero-shot, requiring no training or fine-tuning. Experiments of our method on two benchmark datasets demonstrate the effectiveness of our proposed method.
As virtual and augmented reality applications gain popularity, omnidirectional image (ODI) super-resolution has become increasingly important. Unlike 2D plain images that are formed on a plane, ODIs are projected onto spherical surfaces. Applying established image super-resolution methods to ODIs, therefore, requires performing equirectangular projection (ERP) to map the ODIs onto a plane. ODI super-resolution needs to take into account geometric distortion resulting from ERP. However, without considering such geometric distortion of ERP images, previous methods only utilize a limited range of pixels and may easily miss self-similar textures for reconstruction. In this paper, we introduce a novel Geometric Distortion Guided Transformer for Omnidirectional image Super-Resolution (GDGT-OSR). Specifically, a distortion modulated rectangle-window self-attention mechanism, integrated with deformable self-attention, is proposed to better perceive the distortion and thus involve more self-similar textures. Distortion modulation is achieved through a newly devised distortion guidance generator that produces guidance for the rectangular windows by exploiting the variability of distortion across latitudes. Furthermore, we propose a dynamic feature aggregation scheme to adaptively fuse the features from different self-attention modules. We present extensive experimental results on public datasets and show that the new GDGT-OSR outperforms methods in existing literature.
Omnidirectional images have attracted significant attention in recent years due to the rapid development of virtual reality technologies. Equirectangular projection (ERP), a naive form to store and transfer omnidirectional images, however, is challenging for existing two-dimensional (2D) image super-resolution (SR) methods due to its inhomogeneous distributed sampling density and distortion across latitude. In this paper, we make one of the first attempts to design a spherical pseudo-cylindrical representation, which not only allows pixels at different latitudes to adaptively adopt the best distinct sampling density but also is model-agnostic to most off-the-shelf SR methods, enhancing their performances. Specifically, we start by upsampling each latitude of the input ERP image and design a computationally tractable optimization algorithm to adaptively obtain a (sub)-optimal sampling density for each latitude of the ERP image. Addressing the distortion of ERP, we introduce a new viewport-based training loss based on the original 3D sphere format of the omnidirectional image, which inherently lacks distortion. Finally, we present a simple yet effective recursive progressive omnidirectional SR network to showcase the feasibility of our idea. The experimental results on public datasets demonstrate the effectiveness of the proposed method as well as the consistently superior performance of our method over most state-of-the-art methods both quantitatively and qualitatively.
Benefiting from the 360° field of view (FoV) of the omnidirectional images (ODIs), users could enjoy an immersive experience with head-mounted devices or computers. High-resolution ODIs can provide pleasing visual experience and boost the performance of related visual tasks. Therefore, Super-resolution (SR) is an essential technique during the application of ODIs. However, traditional SR methods fail to enhance the most widely utilized equirectangular projection (ERP) format ODIs due to projection distortions. Existing ODI-SR methods take the latitude-related position information as a prior, but lack the adaptation to the ERP content distribution characteristics. To address this issue, we propose a novel Frequency Attention Transformer ODI-SR (FATO) network focusing on high-frequency details of ODIs. In particular, we transform an ODI into fine-grained patches in the frequency domain through Discrete Cosine Transform (DCT). After that, we design a frequency self-attention mechanism to capture the relationship between different frequency patches. Subsequently, we introduce a frequency loss function to further constrain the network. Extensive experimental results demonstrate that the proposed FATO achieves superior performance over state-of-the-art methods on ODIs.
For convenient transmission, omnidirectional images (ODIs) usually follow the equirectangular projection (ERP) format and are low-resolution. To provide better immersive experience, omnidirectional image super resolution (ODISR) is essential. However, ERP ODIs suffer from serious geometric distortion and pixel stretching across latitudes, generating massive redundant information at high latitudes. This characteristic poses a huge challenge for the traditional SR methods, which can only obtain the suboptimal ODISR performance. To address this issue, we propose a novel position attention network (PAN) for ODISR in this paper. Specifically, a two-branch structure is introduced, in which the basic enhancement branch (BE) serves to achieve coarse deep feature enhancement for extracted shallow features. Meanwhile, the position attention enhancement branch (PAE) builds a positional attention mechanism to dynamically adjust the contribution of features at different latitudes in the ERP representation according to their positions and stretching degrees, which achieves the enhancement for the differentiated information, suppresses the redundant information, and modulate the deep features with spatial distortion. Subsequently, the features of two branches are fused effectively to achieve the further refinement and adapt the distortion characteristic of ODIs. After that, we exploit a long-term memory module (LM), promoting information interactions and fusions between the branches to enhance the perception of the distortion, aggregating the prior hierarchical features to keep the long-term memory and boosting the ODISR performance. Extensive results demonstrate the state-of-the-art performance and the high efficiency of our PAN in ODISR.
Omnidirectional images (ODIs) have obtained lots of research interest for immersive experiences. Although ODIs require extremely high resolution to capture details of the entire scene, the resolutions of most ODIs are insufficient. Previous methods attempt to solve this issue by image super-resolution (SR) on equirectangular projection (ERP) images. However, they omit geometric properties of ERP in the degradation process, and their models can hardly generalize to real ERP images. In this paper, we propose Fisheye downsampling, which mimics the real-world imaging process and synthesizes more realistic low-resolution samples. Then we design a distortion-aware Transformer (OSRT) to modulate ERP distortions continuously and self-adaptively. Without a cumbersome process, OSRT outperforms previous methods by about 0.2dB on PSNR. Moreover, we propose a convenient data augmentation strategy, which synthesizes pseudo ERP images from plain images. This simple strategy can alleviate the over-fitting problem of large networks and significantly boost the performance of ODISR. Extensive experiments have demonstrated the state-of-the-art performance of our OSRT.
The $360^{\circ}$ imaging has recently gained much attention; however, its angular resolution is relatively lower than that of a narrow field-of-view (FOV) perspective image as it is captured using a fisheye lens with the same sensor size. Therefore, it is beneficial to super-resolve a $360^{\circ}$ image. Several attempts have been made, but mostly considered equirectangular projection (ERP) as one of the ways for $360^{\circ}$ image representation despite the latitude-dependent distortions. In that case, as the output high-resolution (HR) image is always in the same ERP format as the low-resolution (LR) input, additional information loss may occur when transforming the HR image to other projection types. In this paper, we propose SphereSR, a novel framework to generate a continuous spherical image representation from an LR $360^{\circ}$ image, with the goal of predicting the RGB values at given spherical coordinates for super-resolution with an arbitrary $360^{\circ}$ image projection. Specifically, first we propose a feature extraction module that represents the spherical data based on an icosahedron and that efficiently extracts features on the spherical surface. We then propose a spherical local implicit image function (SLIIF) to predict RGB values at the spherical coordinates. As such, SphereSR flexibly reconstructs an HR image given an arbitrary projection type. Experiments on various benchmark datasets show that the proposed method significantly surpasses existing methods in terms of performance.
No abstract available
No abstract available
We propose deep convolutional neural network (CNN) based super-resolution for 360 (equirectangular) panorama images used by virtual reality (VR) display devices (e.g. VR glasses). Proposed super-resolution adopts the recent CNN architecture proposed in (Dong et al., 2016) and adapts it for equirectangular panorama images which have specific characteristics as compared to standard cameras (e.g. projection distortions). We demonstrate how adaptation can be performed by optimizing the trained network input size and fine-tuning the network parameters. In our experiments with 360 panorama images of rich natural content CNN based super-resolution achieves average PSNR improvement of 1.36 dB over the baseline (bicubic interpolation) and 1.56 dB by our equirectangular specific adaptation.
In this paper, we study hybrid neural representations for spherical data, a domain of increasing relevance in scientific research. In particular, our work focuses on weather and climate data as well as comic microwave background (CMB) data. Although previous studies have delved into coordinate-based neural representations for spherical signals, they often fail to capture the intricate details of highly nonlinear signals. To address this limitation, we introduce a novel approach named Hybrid Neural Representations for Spherical data (HNeR-S). Our main idea is to use spherical feature-grids to obtain positional features which are combined with a multilayer perception to predict the target signal. We consider feature-grids with equirectangular and hierarchical equal area isolatitude pixelization structures that align with weather data and CMB data, respectively. We extensively verify the effectiveness of our HNeR-S for regression, super-resolution, temporal interpolation, and compression tasks.
Image warping aims to reshape images defined on rectangular grids into arbitrary shapes. Recently, implicit neural functions have shown remarkable performances in representing images in a continuous manner. However, a standalone multi-layer perceptron suffers from learning high-frequency Fourier coefficients. In this paper, we propose a local texture estimator for image warping (LTEW) followed by an implicit neural representation to deform images into continuous shapes. Local textures estimated from a deep super-resolution (SR) backbone are multiplied by locally-varying Jacobian matrices of a coordinate transformation to predict Fourier responses of a warped image. Our LTEW-based neural function outperforms existing warping methods for asymmetric-scale SR and homography transform. Furthermore, our algorithm well generalizes arbitrary coordinate transformations, such as homography transform with a large magnification factor and equirectangular projection (ERP) perspective transform, which are not provided in training.
Virtual reality (VR) can provide users an immersive and realistic visual experience, which leads to the widely use of VR in many fields. However, transmission of the ultrahigh resolution omnidirectional video requires huge bandwidth, which brings great challenges for real-time VR application. In this paper, we propose a scalable omnidirectional video coding method to improve the coding efficiency with the help of the viewer’s point of view (POV) and provide three-layer scalability as well. Based on the equirectangular projection (ERP), a down-sampling procedure of ERP video with corresponding super-resolution method is proposed to save bandwidth and provide spatial resolution scalability. With the super-resolution version of the reconstructed down-sampled video as the inter-view reference, the viewer’s POV within sphere is mapped and encoded in high quality, while the non POV areas are compressed in low quality to further improve the coding efficiency and provide quality scalability. The correlation of the ERP and cube map projection is utilized in the POV mapping procedure. The proposed scalable coding method is achieved based on the multiview extension of high efficiency video coding, where only a few modifications are operated in the encoder side. Experiments results demonstrate that the proposed method can save approximately 75% average bit rate with no significant decrease in quality of the viewer’s POV region compared with HEVC standard.
The Super-Resolution Generative Adversarial Network (SRGAN) is a seminal work that is capable of generating realistic textures during single image super-resolution. However, the hallucinated details are often accompanied with unpleasant artifacts. To further enhance the visual quality, we thoroughly study three key components of SRGAN – network architecture, adversarial loss and perceptual loss, and improve each of them to derive an Enhanced SRGAN (ESRGAN). In particular, we introduce the Residual-in-Residual Dense Block (RRDB) without batch normalization as the basic network building unit. Moreover, we borrow the idea from relativistic GAN to let the discriminator predict relative realness instead of the absolute value. Finally, we improve the perceptual loss by using the features before activation, which could provide stronger supervision for brightness consistency and texture recovery. Benefiting from these improvements, the proposed ESRGAN achieves consistently better visual quality with more realistic and natural textures than SRGAN and won the first place in the PIRM2018-SR Challenge (region 3) with the best perceptual index. The code is available at https://github.com/xinntao/ESRGAN.
Recent research on super-resolution has progressed with the development of deep convolutional neural networks (DCNN). In particular, residual learning techniques exhibit improved performance. In this paper, we develop an enhanced deep super-resolution network (EDSR) with performance exceeding those of current state-of-the-art SR methods. The significant performance improvement of our model is due to optimization by removing unnecessary modules in conventional residual networks. The performance is further improved by expanding the model size while we stabilize the training procedure. We also propose a new multi-scale deep super-resolution system (MDSR) and training method, which can reconstruct high-resolution images of different upscaling factors in a single model. The proposed methods show superior performance over the state-of-the-art methods on benchmark datasets and prove its excellence by winning the NTIRE2017 Super-Resolution Challenge[26].
Despite the breakthroughs in accuracy and speed of single image super-resolution using faster and deeper convolutional neural networks, one central problem remains largely unsolved: how do we recover the finer texture details when we super-resolve at large upscaling factors? The behavior of optimization-based super-resolution methods is principally driven by the choice of the objective function. Recent work has largely focused on minimizing the mean squared reconstruction error. The resulting estimates have high peak signal-to-noise ratios, but they are often lacking high-frequency details and are perceptually unsatisfying in the sense that they fail to match the fidelity expected at the higher resolution. In this paper, we present SRGAN, a generative adversarial network (GAN) for image super-resolution (SR). To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4x upscaling factors. To achieve this, we propose a perceptual loss function which consists of an adversarial loss and a content loss. The adversarial loss pushes our solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images. In addition, we use a content loss motivated by perceptual similarity instead of similarity in pixel space. Our deep residual network is able to recover photo-realistic textures from heavily downsampled images on public benchmarks. An extensive mean-opinion-score (MOS) test shows hugely significant gains in perceptual quality using SRGAN. The MOS scores obtained with SRGAN are closer to those of the original high-resolution images than to those obtained with any state-of-the-art method.
We consider image transformation problems, where an input image is transformed into an output image. Recent methods for such problems typically train feed-forward convolutional neural networks using a per-pixel loss between the output and ground-truth images. Parallel work has shown that high-quality images can be generated by defining and optimizing perceptual loss functions based on high-level features extracted from pretrained networks. We combine the benefits of both approaches, and propose the use of perceptual loss functions for training feed-forward networks for image transformation tasks. We show results on image style transfer, where a feed-forward network is trained to solve the optimization problem proposed by Gatys et al. in real-time. Compared to the optimization-based method, our network gives similar qualitative results but is three orders of magnitude faster. We also experiment with single-image super-resolution, where replacing a per-pixel loss with a perceptual loss gives visually pleasing results.
We present SR3, an approach to image Super-Resolution via Repeated Refinement. SR3 adapts denoising diffusion probabilistic models (Ho et al. 2020), (Sohl-Dickstein et al. 2015) to image-to-image translation, and performs super-resolution through a stochastic iterative denoising process. Output images are initialized with pure Gaussian noise and iteratively refined using a U-Net architecture that is trained on denoising at various noise levels, conditioned on a low-resolution input image. SR3 exhibits strong performance on super-resolution tasks at different magnification factors, on faces and natural images. We conduct human evaluation on a standard 8× face super-resolution task on CelebA-HQ for which SR3 achieves a fool rate close to 50%, suggesting photo-realistic outputs, while GAN baselines do not exceed a fool rate of 34%. We evaluate SR3 on a 4× super-resolution task on ImageNet, where SR3 outperforms baselines in human evaluation and classification accuracy of a ResNet-50 classifier trained on high-resolution images. We further show the effectiveness of SR3 in cascaded image generation, where a generative model is chained with super-resolution models to synthesize high-resolution images with competitive FID scores on the class-conditional 256×256 ImageNet generation challenge.
Recently, several models based on deep neural networks have achieved great success in terms of both reconstruction accuracy and computational performance for single image super-resolution. In these methods, the low resolution (LR) input image is upscaled to the high resolution (HR) space using a single filter, commonly bicubic interpolation, before reconstruction. This means that the super-resolution (SR) operation is performed in HR space. We demonstrate that this is sub-optimal and adds computational complexity. In this paper, we present the first convolutional neural network (CNN) capable of real-time SR of 1080p videos on a single K2 GPU. To achieve this, we propose a novel CNN architecture where the feature maps are extracted in the LR space. In addition, we introduce an efficient sub-pixel convolution layer which learns an array of upscaling filters to upscale the final LR feature maps into the HR output. By doing so, we effectively replace the handcrafted bicubic filter in the SR pipeline with more complex upscaling filters specifically trained for each feature map, whilst also reducing the computational complexity of the overall SR operation. We evaluate the proposed approach using images and videos from publicly available datasets and show that it performs significantly better (+0.15dB on Images and +0.39dB on Videos) and is an order of magnitude faster than previous CNN-based methods.
The pre-trained text-to-image diffusion models have been increasingly employed to tackle the real-world image super-resolution (Real-ISR) problem due to their powerful generative image priors. Most of the existing methods start from random noise to reconstruct the high-quality (HQ) image under the guidance of the given low-quality (LQ) image. While promising results have been achieved, such Real-ISR methods require multiple diffusion steps to reproduce the HQ image, increasing the computational cost. Meanwhile, the random noise introduces uncertainty in the output, which is unfriendly to image restoration tasks. To address these issues, we propose a one-step effective diffusion network, namely OSEDiff, for the Real-ISR problem. We argue that the LQ image contains rich information to restore its HQ counterpart, and hence the given LQ image can be directly taken as the starting point for diffusion, eliminating the uncertainty introduced by random noise sampling. We finetune the pre-trained diffusion network with trainable layers to adapt it to complex image degradations. To ensure that the one-step diffusion model could yield HQ Real-ISR output, we apply variational score distillation in the latent space to conduct KL-divergence regularization. As a result, our OSEDiff model can efficiently and effectively generate HQ images in just one diffusion step. Our experiments demonstrate that OSEDiff achieves comparable or even better Real-ISR results, in terms of both objective metrics and subjective evaluations, than previous diffusion model-based Real-ISR methods that require dozens or hundreds of steps. The source codes are released at https://github.com/cswry/OSEDiff.
A very deep convolutional neural network (CNN) has recently achieved great success for image super-resolution (SR) and offered hierarchical features as well. However, most deep CNN based SR models do not make full use of the hierarchical features from the original low-resolution (LR) images, thereby achieving relatively-low performance. In this paper, we propose a novel residual dense network (RDN) to address this problem in image SR. We fully exploit the hierarchical features from all the convolutional layers. Specifically, we propose residual dense block (RDB) to extract abundant local features via dense connected convolutional layers. RDB further allows direct connections from the state of preceding RDB to all the layers of current RDB, leading to a contiguous memory (CM) mechanism. Local feature fusion in RDB is then used to adaptively learn more effective features from preceding and current local features and stabilizes the training of wider network. After fully obtaining dense local features, we use global feature fusion to jointly and adaptively learn global hierarchical features in a holistic way. Experiments on benchmark datasets with different degradation models show that our RDN achieves favorable performance against state-of-the-art methods.
No abstract available
Transformer-based method has demonstrated promising performance in image super-resolution tasks, due to its long-range and global aggregation capability. However, the existing Transformer brings two critical challenges for applying it in large-area earth observation scenes: (1) redundant token representation due to most irrelevant tokens; (2) single-scale representation which ignores scale correlation modeling of similar ground observation targets. To this end, this paper proposes to adaptively eliminate the interference of irreverent tokens for a more compact self-attention calculation. Specifically, we devise a Residual Token Selective Group (RTSG) to grasp the most crucial token by dynamically selecting the top- $k$ keys in terms of score ranking for each query. For better feature aggregation, a Multi-scale Feed-forward Layer (MFL) is developed to generate an enriched representation of multi-scale feature mixtures during feed-forward process. Moreover, we also proposed a Global Context Attention (GCA) to fully explore the most informative components, thus introducing more inductive bias to the RTSG for an accurate reconstruction. In particular, multiple cascaded RTSGs form our final Top- $k$ Token Selective Transformer (TTST) to achieve progressive representation. Extensive experiments on simulated and real-world remote sensing datasets demonstrate our TTST could perform favorably against state-of-the-art CNN-based and Transformer-based methods, both qualitatively and quantitatively. In brief, TTST outperforms the state-of-the-art approach (HAT-L) in terms of PSNR by 0.14 dB on average, but only accounts for 47.26% and 46.97% of its computational cost and parameters. The code and pre-trained TTST will be available at https://github.com/XY-boy/TTST for validation.
We propose a deep learning method for single image super-resolution (SR). Our method directly learns an end-to-end mapping between the low/high-resolution images. The mapping is represented as a deep convolutional neural network (CNN) that takes the low-resolution image as the input and outputs the high-resolution one. We further show that traditional sparse-coding-based SR methods can also be viewed as a deep convolutional network. But unlike traditional methods that handle each component separately, our method jointly optimizes all layers. Our deep CNN has a lightweight structure, yet demonstrates state-of-the-art restoration quality, and achieves fast speed for practical on-line usage. We explore different network structures and parameter settings to achieve trade-offs between performance and speed. Moreover, we extend our network to cope with three color channels simultaneously, and show better overall reconstruction quality.
We present a novel approach to leverage prior knowledge encapsulated in pre-trained text-to-image diffusion models for blind super-resolution. Specifically, by employing our time-aware encoder, we can achieve promising restoration results without altering the pre-trained synthesis model, thereby preserving the generative prior and minimizing training cost. To remedy the loss of fidelity caused by the inherent stochasticity of diffusion models, we employ a controllable feature wrapping module that allows users to balance quality and fidelity by simply adjusting a scalar value during the inference process. Moreover, we develop a progressive aggregation sampling strategy to overcome the fixed-size constraints of pre-trained diffusion models, enabling adaptation to resolutions of any size. A comprehensive evaluation of our method using both synthetic and real-world benchmarks demonstrates its superiority over current state-of-the-art approaches. Code and models are available at https://github.com/IceClear/StableSR.
Diffusion-based image super-resolution (SR) methods are mainly limited by the low inference speed due to the requirements of hundreds or even thousands of sampling steps. Existing acceleration sampling techniques inevitably sacrifice performance to some extent, leading to over-blurry SR results. To address this issue, we propose a novel and efficient diffusion model for SR that significantly reduces the number of diffusion steps, thereby eliminating the need for post-acceleration during inference and its associated performance deterioration. Our method constructs a Markov chain that transfers between the high-resolution image and the low-resolution image by shifting the residual between them, substantially improving the transition efficiency. Additionally, an elaborate noise schedule is developed to flexibly control the shifting speed and the noise strength during the diffusion process. Extensive experiments demonstrate that the proposed method obtains superior or at least comparable performance to current state-of-the-art methods on both synthetic and real-world datasets, even only with 15 sampling steps. Our code and model are available at https://github.com/zsyOAOA/ResShift.
No abstract available
The fusion of LiDAR and camera sensors offers remarkable results in multimodal 3D object detection with enhanced performance. However, existing fusion methods are primarily designed considering ideal data, ignoring the practical challenges of sensor specification and environmental variations encountered in autonomous driving. Thus, these methods often exhibit a significant performance degradation when faced with adverse conditions, such as sparse point cloud and inclement weather. To address these multiple adverse conditions simultaneously, we present the first attempt to apply auxiliary restoration networks in multimodal 3D object detection. These networks restore degraded point cloud and image, ensuring the primary multimodal detection network obtains higher quality features in a unified form. Especially, we propose a spherical domain point upsampler based on bilateral point generation and an adjustment network with a horizontal alignment block. Additionally, for efficient fusion with restored point cloud and image, we suggest a graph detector with a unified loss function, including auxiliary, contrastive, and difficulty losses. The experimental results demonstrate that the proposed approach prevents a performance decline in adverse conditions and outperforms state-of-the-art methods. The source code with pretrained weights for the proposed model is available at https://github.com/jhyoon964/auxphere.
As an emerging media format, virtual reality (VR) has attracted the attention of researchers. 6-DoF VR can reconstruct the surrounding environment with the help of the depth information of the scene, so as to provide users with immersive experience. However, due to the lack of depth information in panoramic image, it is still a challenge to convert panorama to 6-DOF VR. In this paper, we propose a new depth estimation method SPCNet based on spherical convolution to solve the problem of depth information restoration of panoramic image. Particularly, spherical convolution is introduced to improve depth estimation accuracy by reducing distortion, which is attributed to Equi-Rectangular Projection (ERP). The experimental results show that many indicators of SPCNet are better than other advanced networks. For example, RMSE is 0.419 lower than UResNet. Moreover, the threshold accuracy of depth estimation has also been improved.
Fisheye lens has the prominent advantage of large field of view, which is suitable for the overall environmental monitoring during the restoration of polluted sites. But at the same time, the fisheye lens has a large distortion, which has a great influence on the image matching and stitching between different cameras. Aiming at the shortcomings of traditional wide-angle lens image correction methods, only correction is carried out in the longitude direction, and the correction effect is poor. A fisheye correction algorithm based on spherical perspective projection constraint was proposed, in the case of spherical perspective projection, the space line is no longer projected as a plane line of the image, but as a great circle on the sphere. First, adopt the scan line approximation method to extract the effective area of the fisheye image, to determine the center and radius; then in view of the shortcomings and deficiencies of the spherical coordinate positioning method and the latitude and longitude mapping method. A correction method based on spherical perspective projection constraint is quantized; finally, the experimental verification is carried out. Through the experiment, this method has a good effect on the correction of fisheye image, the shape of the object in the image has been restored, and several arcs have become parallel lines, which are very consistent with the reality.
Reference-based super-resolution (RefSR) enhances the detail restoration capability of low-resolution images (LR) by utilizing the details and texture information of external reference images (Ref). This study proposes a RefSR method based on hash adaptive matching and progressive multi-scale dynamic aggregation to improve the super-resolution reconstruction capability. Firstly, to address the issue of feature matching, this chapter proposes a hash adaptive matching module. On the basis of similarity calculation between traditional LR images and Ref images, self-similarity information of LR images is added to assist in super-resolution reconstruction. By dividing the feature space into multiple hash buckets through spherical hashing, the matching range is narrowed down from global search to local neighborhoods, enabling efficient matching in more informative regions. This not only retains global modeling capabilities, but also significantly reduces computational costs. In addition, a learnable similarity scoring function has been designed to adaptively optimize the similarity score between LR images and Ref images, improving matching accuracy. Secondly, in the process of feature transfer, this chapter proposes a progressive multi-scale dynamic aggregation module. This module utilizes dynamic decoupling filters to simultaneously perceive texture information in both spatial and channel domains, extracting key information more accurately and effectively suppressing irrelevant texture interference. In addition, this module enhances the robustness of the model to large-scale biases by gradually adjusting features at different scales, ensuring the accuracy of texture transfer. The experimental results show that this method achieves superior super-resolution reconstruction performance on multiple benchmark datasets.
A camera module employing spherical single-element lens imaging system (SSLIS) is introduced in this study. This type of imaging system can be used in compact digital cameras or mobile phone cameras, and it provides the advantages of simple design, reduced device bulkiness, and reduced manufacturing costs. When compared with conventional camera modules, our system produces radially variant blurred images, which can be satisfactorily restored by means of a polar domain deconvolution algorithm proposed in our previous study. In this study, we demonstrate an improved version of this algorithm that enables full-field-of-view (FOV) image restoration instead of the partial FOV restoration obtained via our previous algorithm. This improvement is realized by interpolating the upper and arc-shaped boundaries of the panoramic polar image such that the ringing artifacts around the center and four boundaries of the restored Cartesian image are greatly suppressed. The effectiveness of the improved algorithm is verified by image restoration of both computer simulated images and real-world scenes captured by the spherical single lens camera module. The quality of the restored image depends on the overall sparsity of all the point spread function (PSF) block Toeplitz with circulant blocks (BTCB) matrices used to restore a radially blurred image.
No abstract available
Robot soccer, as an advanced field integrating artificial intelligence, image processing, and robot control, has captured the attention of numerous scholars. This study focuses on a rapid recognition method based on omnidirectional vision imagery, aiming to enhance the efficiency of soccer robots in target identification and localization. Initially, the paper explores the selection of omnidirectional vision types and their imaging influencing factors, particularly highlighting the application of catadioptric omnidirectional vision. Subsequently, it analyzes the imaging requirements for spherical targets and proposes a radial length projection model based on convex reflective surfaces. To address issues related to solving reflective surface and viewpoint change errors, this paper establishes corresponding projection functions. Furthermore, it designs a contour-based rapid region labeling algorithm and integrates convex hull restoration methods to improve the accuracy and speed of target recognition. Comparative experiments validate the effectiveness and superiority of the proposed methods. This research not only provides novel insights into visual image processing for robot soccer but also establishes a foundation for further exploration in related fields.
Abstract Segmentation, a useful/powerful technique in pattern recognition, is the process of identifying object outlines within images. There are a number of efficient algorithms for segmentation in Euclidean space that depend on the variational approach and partial differential equation modelling. Wavelets have been used successfully in various problems in image processing, including segmentation, inpainting, noise removal, super-resolution image restoration, and many others. Wavelets on the sphere have been developed to solve such problems for data defined on the sphere, which arise in numerous fields such as cosmology and geophysics. In this work, we propose a wavelet-based method to segment images on the sphere, accounting for the underlying geometry of spherical data. Our method is a direct extension of the tight-frame based segmentation method used to automatically identify tube-like structures such as blood vessels in medical imaging. It is compatible with any arbitrary type of wavelet frame defined on the sphere, such as axisymmetric wavelets, directional wavelets, curvelets, and hybrid wavelet constructions. Such an approach allows the desirable properties of wavelets to be naturally inherited in the segmentation process. In particular, directional wavelets and curvelets, which were designed to efficiently capture directional signal content, provide additional advantages in segmenting images containing prominent directional and curvilinear features. We present several numerical experiments, applying our wavelet-based segmentation method, as well as the common K-means method, on real-world spherical images, including an Earth topographic map, a light probe image, solar data-sets, and spherical retina images. These experiments demonstrate the superiority of our method and show that it is capable of segmenting different kinds of spherical images, including those with prominent directional features. Moreover, our algorithm is efficient with convergence usually within a few iterations.
Image restoration is a long-standing low-level vision problem that aims to restore high-quality images from low-quality images (e.g., downscaled, noisy and compressed images). While state-of-the-art image restoration methods are based on convolutional neural networks, few attempts have been made with Transformers which show impressive performance on high-level vision tasks. In this paper, we propose a strong baseline model SwinIR for image restoration based on the Swin Transformer. SwinIR consists of three parts: shallow feature extraction, deep feature extraction and high-quality image reconstruction. In particular, the deep feature extraction module is composed of several residual Swin Transformer blocks (RSTB), each of which has several Swin Transformer layers together with a residual connection. We conduct experiments on three representative tasks: image super-resolution (including classical, lightweight and real-world image super-resolution), image denoising (including grayscale and color image denoising) and JPEG compression artifact reduction. Experimental results demonstrate that SwinIR outperforms state-of-the-art methods on different tasks by up to 0.14∼0.45dB, while the total number of parameters can be reduced by up to 67%.
Super-Resolution (SR) has gained increasing research attention over the past few years. With the development of Deep Neural Networks (DNNs), many super-resolution methods based on DNNs have been proposed. Although most of these methods are aimed at ordinary frames, there are few works on super-resolution of omnidirectional frames. In these works, omnidirectional frames are projected from the 3D sphere to a 2D plane by Equi-Rectangular Projection (ERP). Although ERP has been widely used for projection, it has severe projection distortion near poles. Current DNN-based SR methods use 2D convolution modules, which is more suitable for the regular grid. In this paper, we find that different projection methods have great impact on the performance of DNNs. To study this problem, a comprehensive comparison of projections in omnidirectional super-resolution is conducted. We compare the SR results of different projection methods. Experimental results show that Equi-Angular cube map projection (EAC), which has minimal distortion, achieves the best result in terms of WS-PSNR compared with other projections. Code and data will be released.
Omnidirectional image super-resolution (ODISR) is critical for VR/AR applications, as high-quality 360° visual content significantly enhances immersive experiences. However, existing ODISR methods suffer from limited receptive fields and high computational complexity, which restricts their ability to model long-range dependencies and extract global structural features. Consequently, these limitations hinder the effective reconstruction of high-frequency details. To address these issues, we propose a novel Mamba-based ODISR network, termed MambaOSR, which consists of three key modules working collaboratively for accurate reconstruction. Specifically, we first introduce a spatial-frequency visual state space model (SF-VSSM) to capture global contextual information via dual-domain representation learning, thereby enhancing the preservation of high-frequency details. Subsequently, we design a distortion-guided module (DGM) that leverages distortion map priors to adaptively model geometric distortions, effectively suppressing artifacts resulting from equirectangular projections. Finally, we develop a multi-scale feature fusion module (MFFM) that integrates complementary features across multiple scales, further improving reconstruction quality. Extensive experiments conducted on the SUN360 dataset demonstrate that our proposed MambaOSR achieves a 0.16 dB improvement in WS-PSNR and increases the mutual information by 1.99% compared with state-of-the-art methods, significantly enhancing both visual quality and the information richness of omnidirectional images.
No abstract available
Omnidirectional images (ODIs) are vital in VR and 360° imaging but suffer from quality degradation during acquisition and transmission. Existing blind SR methods often ignore multi-degradation coupling and spherical geometry. We propose a blind SR algorithm combining a stochastic degradation model (blur, noise, JPEG compression) and an attention-enhanced dual-branch network with SE-guided residual dense blocks and a relativistic adversarial discriminator. The attention mechanism adaptively recalibrates spherical features, while dense connections enable hierarchical fusion across equatorial and polar regions. Evaluations on ODI-SR show significant PSNR/SSIM gains over baselines, with ablation studies validating the synergy of degradation modeling and architectural innovations in preserving high-frequency details and geometric consistency.
No abstract available
No abstract available
Omnidirectional or 360-degree video is being increasingly deployed, largely due to the latest advancements in immersive virtual reality (VR) and extended reality (XR) technology. However, the adoption of these videos in streaming encounters challenges related to bandwidth and latency, particularly in mobility conditions such as with unmanned aerial vehicles (UAVs). Adaptive resolution and compression aim to preserve quality while maintaining low latency under these constraints, yet downscaling and encoding can still degrade quality and introduce artifacts. Machine learning (ML)-based super-resolution (SR) and quality enhancement techniques offer a promising solution by enhancing detail recovery and reducing compression artifacts. However, current publicly available 360-degree video SR datasets lack compression artifacts, which limit research in this field. To bridge this gap, this paper introduces omnidirectional video streaming dataset (ODVista), which comprises 200 high-resolution and high quality videos downscaled and encoded at four bitrate ranges using the high-efficiency video coding (HEVC)/H.265 standard. Evaluations show that the dataset not only features a wide variety of scenes but also spans different levels of content complexity, which is crucial for robust solutions that perform well in real-world scenarios and generalize across diverse visual environments. Additionally, we evaluate the performance, considering both quality enhancement and runtime, of two handcrafted and two ML-based SR models on the validation and testing sets of ODVista.
No abstract available
For the significant distortion problem caused by the special projection method of equi-rectangular projection (ERP) images, this paper proposes an omnidirectional image super-resolution algorithm model based on position information transformation, taking SwinIR as the base. By introducing a space position transformation module that supports deformable convolution, the image preprocessing process is optimized to reduce the distortion effects in the polar regions of the ERP image. Meanwhile, by introducing deformable convolution in the deep feature extraction process, the model’s adaptability to local deformations of images is enhanced. Experimental results on publicly available datasets have shown that our method outperforms SwinIR, with an average improvement of over 0.2 dB in WS-PSNR and over 0.030 in WS-SSIM for ×4 pixel upscaling.
Omnidirectional videos (ODVs) provide an immersive visual experience by capturing the 360° scene. With the rapid advancements in virtual/augmented reality, metaverse, and generative artificial intelligence, the demand for high-quality ODVs is surging. However, ODVs often suffer from low resolution due to their wide field of view and limitations in capturing devices and transmission bandwidth. Although video super-resolution (SR) is a capable video quality enhancement technique, the performance ceiling and practical generalization of existing methods are limited when applied to ODVs due to their unique attributes. To alleviate spatial projection distortions and temporal flickering of ODVs, we propose a Spatio-Temporal Distortion Aware Network (STDAN) with joint spatio-temporal alignment and reconstruction. Specifically, we incorporate a spatio-temporal continuous alignment (STCA) to mitigate discrete geometric artifacts in parallel with temporal alignment. Subsequently, we introduce an interlaced multi-frame reconstruction (IMFR) to enhance temporal consistency. Furthermore, we employ latitude-saliency adaptive (LSA) weights to focus on regions with higher texture complexity and human-watching interest. By exploring a spatio-temporal jointly framework and real-world viewing strategies, STDAN effectively reinforces spatio-temporal coherence on a novel ODV-SR dataset and ensures affordable computational costs. Extensive experimental results demonstrate that STDAN outperforms state-of-the-art methods in improving visual fidelity and dynamic smoothness of ODVs.
With the increasing popularity of virtual techniques, such as virtual reality (VR) and augmented reality (AR), super-resolution (SR) of omnidirectional images has been crucial for more immersive and realistic experiences. This advancement also enhances the quality of images for various visual applications. Researchers have started exploring omnidirectional image super-resolution (ODISR). However, existing methods primarily address the problem using synthetic data pairs, where low-resolution (LR) images are generated using fixed, predefined kernels, such as bicubic downsampling. Consequently, the performance of these methods drops significantly when applied to real-world data. To address this issue, in this paper, we propose exploring the rich image priors from existing SR models designed for 2D planar images and adapting them for real-world ODISR. Specifically, we employ low-rank adaptation (LoRA) to adapt a large-scale model from the 2D planar image domain to the omnidirectional image domain by training only the decomposed matrices. This approach significantly reduces the number of parameters and computational resources required. Experimental results demonstrate that the proposed method outperforms other state-of-the-art methods both quantitatively and qualitatively.
This report introduces two high-quality datasets Flickr360 and ODV360 for omnidirectional image and video super-resolution, respectively, and reports the NTIRE 2023 challenge on 360° omnidirectional image and video super-resolution. Unlike ordinary 2D images/videos with a narrow field of view, omnidirectional images/videos can represent the whole scene from all directions in one shot. There exists a large gap between omnidirectional image/video and ordinary 2D image/video in both the degradation and restoration processes. The challenge is held to facilitate the development of omnidirectional image/video super-resolution by considering their special characteristics. In this challenge, two tracks are provided: one is the omnidirectional image super-resolution and the other is the omnidirectional video super-resolution. The task of the challenge is to super-resolve an input omnidirectional image/video with a magnification factor of ×4. Realistic omnidirectional downsampling is applied to construct the datasets. Some general degradation(e.g., video compression) is also considered for the video track. The challenge has 100 and 56 registered participants for those two tracks. In the final testing stage, 7 and 3 participating teams submitted their results, source codes, and fact sheets. Almost all teams achieved better performance than baseline models by integrating omnidirectional characteristics, reaching compelling performance on our newly collected Flickr360 and ODV360 datasets.
Omnidirectional images (ODIs) have recently attracted extensive attention from both academia and industry. However, due to storage and transmission limitations, ODIs are usually at extremely low resolution. Thus, it is necessary to restore a high-resolution ODI from a low-resolution ODI, i.e., omnidirectional image super-resolution (ODI-SR). Towards ODI-SR, we propose in this paper a novel latitude-aware upscaling network, namely LAU-Net+, which fully considers the above characteristics of ODIs. In our network, different latitude bands can learn to adopt distinct upscaling factors, which significantly saves the computational resources and improves the SR efficiency. Specifically, a Laplacian multilevel pyramid network is introduced in which the upscaling factor is gradually increased with the number of levels. Each level is composed of a feature enhancement module (FEM), a drop-band decision module (DDM) and a high-latitude enhancement module (HEM). The FEM module serves to enhance the high-level features extracted from the input ODI, while the role of DDM is to dynamically drop the unnecessary high latitude bands and send the remained bands to the next level. The HEM is adopted to further enhance high-level features of dropped latitude bands with a lightweight architecture. In DDM, we develop a reinforcement learning scheme with a latitude adaptive reward to determine which band should be dropped. To the best of our knowledge, our method is the first work which considers the latitude characteristics for ODI-SR task. Extensive experimental results demonstrate that our LAU-Net+ achieves state-of-the-art results on ODI-SR both quantitatively and qualitatively on various ODI datasets.
360° omnidirectional images have gained research attention due to their immersive and interactive experience, particularly in AR/VR applications. However, they suffer from lower angular resolution due to being captured by fisheye lenses with the same sensor size for capturing planar images. To solve the above issues, we propose a two-stage framework for 360° omnidirectional image super-resolution. The first stage employs two branches: model A, which incorporates omnidirectional position-aware deformable blocks (OPDB) and Fourier upsampling, and model B, which adds a spatial frequency fusion module (SFF) to model A. Model A aims to enhance the feature extraction ability of 360° image positional information, while Model B further focuses on the high-frequency information of 360° images. The second stage performs same-resolution enhancement based on the structure of model A with a pixel unshuffle operation. In addition, we collected data from YouTube to improve the fitting ability of the transformer, and created pseudo low-resolution images using a degradation network. Our proposed method achieves superior performance and wins the NTIRE 2023 challenge of 360° omnidirectional image super-resolution.
No abstract available
Omnidirectional image (ODI) super-resolution (SR) is an important technique in augmented reality and virtual reality applications to address the low-resolution problem caused by limitations in capturing devices or bandwidth. The ODI projection distortion makes it challenging to apply existing SR methods. In this paper, we propose an ODI SR method by leveraging the characteristics of ODIs and human visual characteristics. Specifically, we firstly design a perception-orientated adaptive loss function by jointly utilizing saliency map and latitude map. In our proposed ODI-SR network, we introduce an attention module to aggregate multi-scale information and leverage spherical convolution to adapt to the spheric format of ODIs. Furthermore, we design a data augmentation strategy for ODIs according to viewpoint distribution to further improve the visual quality of SR images. Extensive experimental results demonstrate that the proposed method achieves state-of-the-art performance according to both qualitative and quantitative evaluations.
The omnidirectional images (ODIs) are usually at low-resolution, due to the constraints of collection, storage and transmission. The traditional two-dimensional (2D) image super-resolution methods are not effective for spherical ODIs, because ODIs tend to have non-uniformly distributed pixel density and varying texture complexity across latitudes. In this work, we propose a novel latitude adaptive upscaling network (LAU-Net) for ODI super-resolution, which allows pixels at different latitudes to adopt distinct upscaling factors. Specifically, we introduce a Laplacian multi-level separation architecture to split an ODI into different latitude bands, and hierarchically upscale them with different factors. In addition, we propose a deep reinforcement learning scheme with a latitude adaptive reward, in order to automatically select optimal upscaling factors for different latitude bands. To the best of our knowledge, LAU-Net is the first attempt to consider the latitude difference for ODI super-resolution. Extensive results demonstrate that our LAU-Net significantly advances the super-resolution performance for ODIs. Codes are available at https://github.com/wangh-allen/LAU-Net.
An omnidirectional image (ODI) enables viewers to look in every direction from a fixed point through a head-mounted display providing an immersive experience compared to that of a standard image. Designing immersive virtual reality systems with ODIs is challenging as they require high resolution content. In this paper, we study super-resolution for ODIs and propose an improved generative adversarial network based model which is optimized to handle the artifacts obtained in the spherical observational space. Specifically, we propose to use a fast PatchGAN discriminator, as it needs fewer parameters and improves the super-resolution at a fine scale. We also explore the generative models with adversarial learning by introducing a spherical-content specific loss function, called 360-SS. To train and test the performance of our proposed model we prepare a dataset of 4500 ODIs. Our results demonstrate the efficacy of the proposed method and identify new challenges in ODI super-resolution for future investigations.
Although on-device video super-resolution enables high-quality live 360-degree streaming on mobile devices, existing methods often waste energy by overlooking perceived visual quality. In this paper, we present EOS, an energy-efficient on-device super-resolution system for mobile omnidirectional video (ODV) live streaming. EOS reduces energy waste by dynamically adjusting super-resolution complexity based on the predicted visual quality of super-resolved frames. This approach raises two challenges: (1) designing an adaptive inference policy that maximizes energy savings while minimizing degradation in Quality-of-Experience (QoE), and (2) developing a method to predict visual quality under the constraints of mobile ODV live streaming. To tackle these challenges, EOS introduces EOS SR and a No-Reference Up-scaling Quality Prediction scheme. EOS SR employs a device-agnostic, scalable deep neural network optimized for mobile devices, with an energy-aware scheduler that jointly selects the optimal super-resolution model and GPU frequency. The No-Reference Upscaling Quality Prediction scheme estimates visual quality across arbitrary viewpoints in real time without requiring high-resolution reference videos. Experiments on commodity smartphones show that EOS reduces average power consumption by 34.6%–49.9% compared to baseline methods, while preserving high visual quality and frame rates.
Omnidirectional images (ODIs) serve as fundamental visual medium for presenting virtual reality (VR) contents, supporting fully immersive experiences through 360-degree scene representation. Typically, a high pixel density is essential for visual quality in VR environments, which in turn requires sufficiently high-resolution imagery to achieve. However, capturing native high-resolution ODIs requires expensive omnidirectional cameras with large sensors (e.g., Insta360 TITAN). An alternative approach is to use low-resolution cameras to acquire original images and then enhance their resolution via super-resolution algorithms. In this work, we explore whether super-resolution ODIs can be easily distinguished from native high-resolution ODIs at 8K scale. To this end, we firstly construct the Native Resolution Assessment of 8K Omnidirectional Images (NRA- 8KODI) dataset, whose native 8K ODIs are collected with an Insta360 TITAN camera and 8K super-resolution images are generated from SOTA open-sourced algorithms. Recognizing high-frequency signals are essential for differentiating non-native 8K ODIs, a frequency-aware model is designed to capture high-frequency details. Specially, to maintain high-frequency details kept in high-resolutions while reduce computational costs brought by high-resolutions, we propose a frequency-aware compressor module to suppress feature channels dominated by low-frequency details. Finally, our model achieves 97.2% accuracy in detecting non-native 8K ODIs, implying that super-resolution for ODIs can still be improved for visual experience in VR applications.
Omnidirectional images (ODIs) demand considerably higher resolution to ensure high quality across all viewports. Traditional convolutional neural networks (CNN)-based single-image super-resolution (SISR) networks, however, are not effective for spherical ODIs. This is due to the uneven pixel density distribution and varying texture complexity in different regions that arise when projecting from a sphere to a plane. Additionally, the computational and memory costs associated with large-sized ODIs present a challenge for real-world application. To address these issues, we propose an efficient distortion-adaptive super-resolution network (ODA-SRN). Specifically, ODA-SRN employs a series of specially designed Distortion Attention Block Groups (DABG) as its backbone. Our Distortion Attention Blocks (DABs) utilize multi-segment parameterized convolution to generate dynamic filters, which compensate for distortion and texture fading during feature extraction. Moreover, we introduce an upsampling scheme that accounts for the dependence of pixel position and distortion degree to achieve pixel-level distortion offset. A comprehensive set of results demonstrates that our ODA-SRN significantly improves the super-resolution performance for ODIs, both quantitatively and qualitatively, when compared to other state-of-the-art methods.
The live streaming of omnidirectional video (ODV) on mobile devices demands considerable network resources; thus, current mobile networks are incapable of providing users with high-quality ODV equivalent to conventional flat videos. We observe that mobile devices, in fact, underutilize graphics processing units (GPUs) while processing ODVs; hence, we envisage an opportunity exists in exploiting video super-resolution (VSR) for improved ODV quality. However, the device-specific discrepancy in GPU capability and dynamic behavior of GPU frequency in mobile devices create a challenge in providing VSR-enhanced ODV streaming. In this paper, we propose OmniLive, an on-device VSR system for mobile ODV live streaming. OmniLive addresses the dynamicity of GPU capability with an anytime inference-based VSR technique called Omni SR. For Omni SR, we design a VSR deep neural network (DNN) model with multiple exits and an inference scheduler that decides on the exit of the model at runtime. OmniLive also solves the performance heterogeneity of mobile GPUs using the Omni neural architecture search (NAS) scheme. Omni NAS finds an appropriate DNN model for each mobile device with Omni SR-specific neural architecture search techniques. We implemented OmniLive as a fully functioning system encompassing a streaming server and Android application. The experiment results show that our anytime VSR model provides four times upscaled videos while saving up to 57.15% of inference time compared with the previous super-resolution model showing the lowest inference time on mobile devices. Moreover, OmniLive can maintain 30 frames per second while fully utilizing GPUs on various mobile devices.
In this paper, we propose a distortion-aware loop filtering model to improve the performance of intra coding for 360$^o$ videos projected via equirectangular projection (ERP) format. To enable the awareness of distortion, our proposed module analyzes content characteristics based on a coding unit (CU) partition mask and processes them through partial convolution to activate the specified area. The feature recalibration module, which leverages cascaded residual channel-wise attention blocks (RCABs) to adjust the inter-channel and intra-channel features automatically, is capable of adapting with different quality levels. The perceptual geometry optimization combining with weighted mean squared error (WMSE) and the perceptual loss guarantees both the local field of view (FoV) and global image reconstruction with high quality. Extensive experimental results show that our proposed scheme achieves significant bitrate savings compared with the anchor (HM + 360Lib), leading to 8.9%, 9.0%, 7.1% and 7.4% on average bit rate reductions in terms of PSNR, WPSNR, and PSNR of two viewports for luminance component of 360^o videos, respectively.
Due to wide use of social media and a large field of view, $\mathbf{3 6 0}$ degrees images are very popular among all the categories of content creators as well as in researchers in the areas like Virtual reality, Surveillance, Robotics and many more. These images provide large data which can be used commercially with a single shot omnidirectional camera. As human visual systems are not adapted to such omnidirectional images, the unwrapping of these images becomes the essential part. The transformed images called Equirectangular images offer large distortions and to minimize the effect of distortion Enhanced Equirectangular Projection images unwrapping algorithm is proposed with the help of nonlinear stretching method. The experimental results prove that the Enhanced Equirectangular methods gives better results than EPR method.
This paper discusses the methods used for projecting 360-degree video onto a two-dimensional plane to reduce the transmission bandwidth. The two primary methods examined are Equirectangular Projection and Cube Projection. Equirectangular Projection is known for its oversampling problem at high latitudes, leading to uneven pixel distribution. To address this, a region adaptive smoothing technology is introduced to optimize video quality and save bandwidth. Cube Projection, which maps spherical video onto the faces of a cube, is analyzed for its ability to maintain pixel uniformity and reduce geometric distortion. The paper further discusses the solutions for correcting motion compensation errors in Cube Projection, emphasizing the importance of multi-facet and cube face extensions in eliminating geometric distortions.
Depth estimation from a monocular 360 image is an emerging problem that gains popularity due to the availability of consumer-level 360 cameras and the complete surrounding sensing capability. While the standard of 360 imaging is under rapid development, we propose to predict the depth map of a monocular 360 image by mimicking both peripheral and foveal vision of the human eye. To this end, we adopt a two-branch neural network leveraging two common projections: equirectangular and cubemap projections. In particular, equirectangular projection incorporates a complete field-of-view but introduces distortion, whereas cubemap projection avoids distortion but introduces discontinuity at the boundary of the cube. Thus we propose a bi-projection fusion scheme along with learnable masks to balance the feature map from the two projections. Moreover, for the cubemap projection, we propose a spherical padding procedure which mitigates discontinuity at the boundary of each face. We apply our method to four panorama datasets and show favorable results against the existing state-of-the-art methods.
In contrast with traditional images, omnidirectional image (OI) has a higher resolution and provides the user with an interactive wide field of view. OI with equirectangular projection (ERP) format, as the default for encoding and transmitting omnidirectional visual contents, is not suitable for quality assessment of OI because of serious geometric distortion in the bipolar regions, especially for blind image quality assessment. In this paper, a segmented spherical projection (SSP) based blind omnidirectional image quality assessment (SSP-BOIQA) method is proposed. The OI with ERP format is first converted into that with SSP format, so as to solve the problem of stretching distortion in the bipolar regions of ERP format, but retain the equatorial region of ERP format. On the one hand, considering that the bipolar regions of the SSP format are circular, a local/global perceptual features extraction scheme with fan-shaped window is proposed for estimating the distortion in the bipolar regions of OI. On the other hand, the perceptual features of the equatorial region are extracted with heat map as weighting factor to reflect users’ visual behavior. Then, the features extracted from the OI’s bipolar and equatorial regions are pooled to predict the quality of distorted OIs. The experiments on two databases, namely CVIQD2018 and MVAQD databases, demonstrate that the proposed SSP-BOIQA method outperforms the state-of-the-art blind quality assessment methods, and is more consistent with human visual perception.
Compared to 2D perspective images, panoramic images capture a larger field-of-view (FOV). Depth estimation from panoramas is an important task for 3D scene understanding and has made significant progress with the development of CNNs. However, existing CNN-based methods still suffer from the Equirectangular Projection (ERP) problem to deal with panoramic distortions (e.g. same receptive fields near the equator and the two poles) and have difficulty generating accurate depth boundaries. In contrast to existing CNN-based methods, in this paper, a novel Transformer-based method is proposed which is able to cope with panoramic distortions and to generate accurate depth boundaries. A Distortion-aware Transformer is designed using a yaw-invariant cycle shift and a distortion-guided partitioning. The aim is to alleviate the distortion effect by enlarging the receptive fields in both horizontal and vertical directions. Then, a Gradient Transformer is proposed to enhance the features around the boundaries. Gradient information is adopted as a boundary prior. Large-scale experimental results show an improvement compared to state-of-the-art methods. Our method also shows strong generalization capabilities. Finally, our method is extended to panorama semantic segmentation.
Image-based salient object detection (SOD) has been extensively explored in the past decades. However, SOD on 360<inline-formula><tex-math notation="LaTeX">$^\circ$</tex-math></inline-formula> omnidirectional images is less studied owing to the lack of datasets with pixel-level annotations. Toward this end, this paper proposes a 360<inline-formula><tex-math notation="LaTeX">$^\circ$</tex-math></inline-formula> image-based SOD dataset that contains 500 high-resolution equirectangular images. We collect the representative equirectangular images from five mainstream 360<inline-formula><tex-math notation="LaTeX">$^\circ$</tex-math></inline-formula> video datasets and manually annotate all objects and regions over these images with precise masks with a free-viewpoint way. To the best of our knowledge, it is the first public available dataset for salient object detection on 360<inline-formula><tex-math notation="LaTeX">$^\circ$</tex-math></inline-formula> scenes. By observing this dataset, we find that distortion from projection, large-scale complex scene and small salient objects are the most prominent characteristics. Inspired by the founding, this paper proposes a baseline model for SOD on equirectangular images. In the proposed approach, we construct a distortion-adaptive module to deal with the distortion caused by the equirectangular projection. In addition, a multi-scale contextual integration block is introduced to perceive and distinguish the rich scenes and objects in omnidirectional scenes. The whole network is organized in a progressively manner with deep supervision. Experimental results show the proposed baseline approach outperforms the top-performanced state-of-the-art methods on 360<inline-formula><tex-math notation="LaTeX">$^\circ$</tex-math></inline-formula> SOD dataset. Moreover, benchmarking results of the proposed baseline approach and other methods on 360<inline-formula><tex-math notation="LaTeX">$^\circ$</tex-math></inline-formula> SOD dataset show the proposed dataset is very challenging, which also validate the usefulness of the proposed dataset and approach to boost the development of SOD on 360<inline-formula><tex-math notation="LaTeX">$^\circ$</tex-math></inline-formula> omnidirectional scenes.
Efficient compression of omnidirectional video is important for emerging virtual reality applications. To compress this kind of video, each frame is projected to a 2D plane [e.g., equirectangular projection (ERP) map] first, adapting to the input format of existing video coding systems. At the display side, an inverse projection is applied to the reconstructed video to restore signals in spherical domain. Such a projection, however, makes presentation and encoding in different domains. Thus, an encoder agnostic to the projection performs inefficiently. In this paper, we analyze how a projection influences the distortion measurements in different domains. Based on the analysis, we propose a scheme to optimize the encoding process based on signals’ distortion in spherical domain. With the proposed optimization, an average 4.31% (up to 9.67%) luma BD-rate reduction is achieved for ERP in random access configuration. The corresponding bit saving is averagely 10.84% (up to 34.44%) when considering the viewing field being $\pi /2$ . The proposed method also benefits other projections and viewport settings, with a marginal complexity increase.
No abstract available
The polyhedron projection for 360-degree video is becoming more and more popular since it can lead to much less geometry distortion compared with the equirectangular projection. However, in the polyhedron projection, we can observe very obvious texture discontinuity in the area near the face boundary. Such a texture discontinuity may lead to serious quality degradation when motion compensation crosses the discontinuous face boundary. To solve this problem, in this paper, we first propose to fill the corresponding neighboring faces in the suitable positions as the extension of the current face to keep approximated texture continuity. Then a co-projection-plane based 3-D padding method is proposed to project the reference pixels in the neighboring face to the current face to guarantee exact texture continuity. Under the proposed scheme, the reference pixel is always projected to the same plane with the current pixel when performing motion compensation so that the texture discontinuity problem can be solved. The proposed scheme is implemented in the reference software of High Efficiency Video Coding. Compared with the existing method, the proposed algorithm can significantly improve the rate-distortion performance. The experimental results obviously demonstrate that the texture discontinuity in the face boundary can be well handled by the proposed algorithm.
This paper presents an efficient method for encoding common projection formats in 360° video coding, in which we exploit inactive regions. These regions are ignored in the reconstruction of the equirectangular format or the viewport in virtual reality applications. As the content of these pixels is irrelevant, we neglect the corresponding pixel values in rate-distortion optimization, residual transformation, as well as in-loop filtering and achieve bitrate savings of up to 10%.
No abstract available
This is a technical report on the 360-degree panoramic image generation task based on diffusion models. Unlike ordinary 2D images, 360-degree panoramic images capture the entire $360^\circ\times 180^\circ$ field of view. So the rightmost and the leftmost sides of the 360 panoramic image should be continued, which is the main challenge in this field. However, the current diffusion pipeline is not appropriate for generating such a seamless 360-degree panoramic image. To this end, we propose a circular blending strategy on both the denoising and VAE decoding stages to maintain the geometry continuity. Based on this, we present two models for \textbf{Text-to-360-panoramas} and \textbf{Single-Image-to-360-panoramas} tasks. The code has been released as an open-source project at \href{https://github.com/ArcherFMY/SD-T2I-360PanoImage}{https://github.com/ArcherFMY/SD-T2I-360PanoImage} and \href{https://www.modelscope.cn/models/damo/cv_diffusion_text-to-360panorama-image_generation/summary}{ModelScope}
The continuous development of virtual reality animation has brought people a new viewing experience. However, there is still a large research space for the construction of virtual scenes. Underwater scenes are complex and diverse, and to obtain more realistic virtual scenes, it is necessary to use video panoramic images as reference modeling in advance. To this end, the study uses the [Formula: see text]-means clustering method to extract key frames from underwater video, and adaptively adjusts the number of clusters to improve the extraction algorithm according to the differences in features. To address the problems of low contrast and severe blurring in underwater images, the study uses an improved non-local a priori recovery method to achieve the recovery process of underwater images. Finally, the final underwater panoramic image is obtained by fading-out image fusion and frame to stitching image synthesis strategy. The experimental analysis shows that the runtime of Model 1 is 21.46[Formula: see text]s, the root mean square error value is 1.89, the structural similarity value is 0.9678, and the average gradient value is 12.59. It can achieve efficient and high-quality panoramic image generation.
No abstract available
Image stitching technology aims to generate a Stitched Panoramic Image (SPI) by stitching multiple narrow-view images containing over-lapping regions. However, heterogeneous artifacts may be introduced during the process of image stitching, which affects the visual perception. To automatically and accurately evaluate the perceptual quality of SPIs, a novel local visual and global deep features based blind SPI quality evaluation method is proposed in this paper. To be specific, with the consideration that image stitching mainly destroys structure, texture and color information, we first design a color structure-texture joint dictionary trained from the constructed stitched-specific image patches dataset. Given an input SPI, its local visual and global deep features are extracted to characterize the stitched-specific distortions. For local visual features, the trained dictionary is employed to capture structure, texture and color distortions by sparse features extraction. Then, considering that sparse features are insensitive to weak structural distortions, weighted local binary pattern features are extracted to measure various weak distortions. For global perceptual features, deep features are extracted via a pre-trained convolutional neural network model to represent the high-level semantics. Finally, considering the diversity of the extracted features, an ensemble learning strategy is adopted to promote the generalization performance and prediction accuracy of the proposed model. Experimental results show that, compared with the conventional 2D and SPI quality measurement methods, the proposed method can measure the stitched-specific distortions more accurately, and is more coincident with subjective ratings.
Image outpainting gains increasing attention since it can generate the complete scene from a partial view, providing a valuable solution to construct 360° panoramic images. As image outpainting suffers from the intrinsic issue of unidirectional completion flow, previous methods convert the original problem into inpainting, which allows a bidirectional flow. However, we find that inpainting has its own limitations and is inferior to outpainting in certain situations. The question of how they may be combined for the best of both has as yet remained under-explored. In this paper, we provide a deep analysis of the differences between inpainting and outpainting, which essentially depends on how the source pixels contribute to the unknown regions under different spatial arrangements. Motivated by this analysis, we present a Cylin-Painting framework that involves meaningful collaborations between inpainting and outpainting and efficiently fuses the different arrangements, with a view to leveraging their complementary benefits on a seamless cylinder. Nevertheless, straightforwardly applying the cylinder-style convolution often generates visually unpleasing results as it discards important positional information. To address this issue, we further present a learnable positional embedding strategy to incorporate the missing component of positional encoding into the cylinder convolution, which significantly improves the panoramic results. It is noted that while developed for image outpainting, the proposed algorithm can be effectively extended to other panoramic vision tasks, such as object detection, depth estimation, and image super-resolution. Code will be made available at https://github.com/KangLiao929/Cylin-Painting.
360-degree video is an emerging form of media that encodes information about all directions surrounding a camera, offering an immersive experience to the users. Unlike traditional 2D videos, visual information in 360-degree videos can be naturally represented as pixels on a sphere. Inspired by state-of-the-art deep-learning-based 2D image super-resolution models and spherical CNNs, in this article, we design a novel spherical super-resolution (SSR) approach for 360-degree videos. To support viewport-adaptive and bandwidth-efficient transmission/streaming of 360-degree video data and save computation, we propose the Focused Icosahedral Mesh to represent a small area on the sphere. We further construct matrices to rotate spherical content over the entire sphere to the focused mesh area, allowing us to use the focused mesh to represent any area on the sphere. Motivated by the PixelShuffle operation for 2D super-resolution, we also propose a novel VertexShuffle operation on the mesh and an improved version VertexShuffle_V2. We compare our SSR approach with state-of-the-art 2D super-resolution models and show that SSR has the potential to achieve significant benefits when applied to spherical signals.
With the recent successes of deep learning models, the performance of 2D image super-resolution has improved significantly. Inspired by recent state-of-the-art 2D super-resolution models and spherical CNNs, in this paper, we design a novel spherical superresolution (SSR) approach for 360-degree videos. To address the bandwidth waste problem associated with 360-degree video transmission/streaming and save computation, we propose the Focused Icosahedral Mesh to represent a small area on the sphere and construct matrices to rotate spherical content to the focused mesh area. We also propose a novel VertexShuffle operation on the mesh, motivated by the 2D PixelShuffle operation. We compare our SSR approach with state-of-the-art 2D super-resolution models. We show that SSR has the potential to achieve significant benefits when applied to spherical signals.
360° videos are getting popular providing an immersive streaming experience for the user, nevertheless, demand high bandwidth in mobile networks due to their larger spherical frames. In this preliminary work, we propose to combine frame interpolation and super resolution methods to optimize tile based 360° video delivery by streaming them at low qualities in network and increasing the quality leveraging Multi Access Edge Computing. We propose a mechanism to adaptively decide this quality conversion at the client side which improves average video quality by 30% and bandwidth saving by 43.3% compared to existing tile based streaming.
Spherical videos, also known as \ang{360} (panorama) videos, can be viewed with various virtual reality devices such as computers and head-mounted displays. They attract large amount of interest since awesome immersion can be experienced when watching spherical videos. However, capturing, storing and transmitting high-resolution spherical videos are extremely expensive. In this paper, we propose a novel single frame and multi-frame joint network (SMFN) for recovering high-resolution spherical videos from low-resolution inputs. To take advantage of pixel-level inter-frame consistency, deformable convolutions are used to eliminate the motion difference between feature maps of the target frame and its neighboring frames. A mixed attention mechanism is devised to enhance the feature representation capability. The dual learning strategy is exerted to constrain the space of solution so that a better solution can be found. A novel loss function based on the weighted mean square error is proposed to emphasize on the super-resolution of the equatorial regions. This is the first attempt to settle the super-resolution of spherical videos, and we collect a novel dataset from the Internet, MiG Panorama Video, which includes 204 videos. Experimental results on 4 representative video clips demonstrate the efficacy of the proposed method. The dataset and code are available at this https URL.
With the emerging of 360-degree image/video, augmented reality (AR) and virtual reality (VR), the demand for analysing and processing spherical signals get tremendous increase. However, plenty of effort paid on planar signals that projected from spherical signals, which leading to some problems, e.g. waste of pixels, distortion. Recent advances in spherical CNN have opened up the possibility of directly analysing spherical signals. However, they pay attention to the full mesh which makes it infeasible to deal with situations in real-world application due to the extremely large bandwidth requirement. To address the bandwidth waste problem associated with 360-degree video streaming and save computation, we exploit Focused Icosahedral Mesh to represent a small area and construct matrices to rotate spherical content to the focused mesh area. We also proposed a novel VertexShuffle operation that can significantly improve both the performance and the efficiency compared to the original MeshConv Transpose operation introduced in UGSCNN. We further apply our proposed methods on super resolution model, which is the first to propose a spherical super-resolution model that directly operates on a mesh representation of spherical pixels of 360-degree data. To evaluate our model, we also collect a set of high-resolution 360-degree videos to generate a spherical image dataset. Our experiments indicate that our proposed spherical super-resolution model achieves significant benefits in terms of both performance and inference time compared to the baseline spherical super-resolution model that uses the simple MeshConv Transpose operation. In summary, our model achieves great super-resolution performance on 360-degree inputs, achieving 32.79 dB PSNR on average when super-resoluting 16x vertices on the mesh.
Omnidirectional images (ODIs), also known as 360-degree images, enable viewers to explore all directions of a given 360-degree scene from a fixed point. Designing an immersive imaging system with ODI is challenging as such systems require very large resolution coverage of the entire 360 viewing space to provide an enhanced quality of experience (QoE). Despite remarkable progress on single image super-resolution (SISR) methods with deep-learning techniques, no study for quality assessments of super-resolved ODIs exists to analyze the quality of such SISR techniques. This paper proposes an objective, full-reference quality assessment framework which studies quality measurement for ODIs generated by GAN-based and CNN-based SISR methods. The quality assessment framework offers to utilize tangential views to cope with the spherical nature of a given ODIs. The generated tangential views are distortion-free and can be efficiently scaled to high-resolution spherical data for SISR quality measurement. We extensively evaluate two state-of-the-art SISR methods using widely used full-reference SISR quality metrics adapted to our designed framework. In addition, our study reveals that most objective metric show high performance over CNN based SISR, while subjective tests favors GAN-based architectures.
As a driving force behind Virtual Reality (VR), 360-degree video streaming enables immersive and interactive experiences that seamlessly connect the physical and digital worlds. However, their bandwidth-intensive nature poses substantial challenges to existing systems. To address the high bandwidth demand associated with streaming 360-degree videos, some researchers have proposed super-resolution technology enhanced video streaming systems. However, these solutions often face challenges in uniformly applying to diverse user devices, which exhibit significant variations in bandwidth and computational capabilities. This paper introduces SRA360, an innovative super-resolution enhanced adaptive 360-degree video streaming scheme for heterogeneous viewers. It integrates server, edge, and client layers, utilizing a Meta Proximal Policy Optimization algorithm for adaptive video quality selection and computational task scheduling in heterogeneous environments. Then a Reconstruction and Retransmission enabled Double Buffer mechanism is proposed to solve the conflict of smooth playback and high-precision FoV prediction. Real-world trace-driven experiment results demonstrate the superiority of SRA360 in enhancing quality of experience and reducing bandwidth consumption compared to state-of-art methods.
360° video streaming requires considerable bandwidth, and many techniques have been proposed to address this problem. One such technique is super-resolution, where the video is compressed at the server, and the client runs a deep learning model to enhance the video quality. However, most of today’s off-the-shelf mobile devices cannot support super-resolution for all tiles in real time. As a result, some tiles cannot be reconstructed to high resolution, significantly reducing users’ Quality of Experience (QoE). To address this problem, we utilize linear interpolation, which requires much less computational overhead. Through experiments, we observe that interpolation can achieve comparable quality, and even outperform super-resolution for some tiles with low spatial complexity. Building on this, we develop a 360° video streaming system that adaptively selects the most suitable downloading strategy, whether interpolation, super-resolution, or ABR at the appropriate bitrate, for each tile to maximize user QoE while considering network bandwidth limitations and the computational constraints of mobile devices. We formalize the 360° video streaming problem as an optimization problem and propose an efficient algorithm to solve it. Extensive evaluations using real user viewing data and 5G network traces demonstrate that our solution significantly outperforms existing techniques in terms of QoE under various scenarios.
No abstract available
With the rapid development of VR technology, 360-degree video is gradually coming into public view. Compared with normal video, it can bring a better immersive experience to the user with the trade-off of a huge bandwidth and transmission latency. Many studies choose to transmit low-resolution video frames to solve those problems and use super-resolution on the client side to reconstruct the resolution of the whole frame. However, in those methods, computational latency caused by super-resolution process of either a whole video frame or viewport portion of a frame, is critical problem. In this paper, the idea of central vision based super resolution is considered in 360-degree video streaming. Particularly, we adopt FOCAS - a super resolution method using foveated rendering for normal 2D video, to 360-degree video. We conduct an experiment in order to verify the effectiveness of human central vision. The experimental results show that it is able to reduce the computational latency by 82% while maintaining a competitive visual quality compared to existing super-resolution methods.
In this paper, we propose a device energy-efficient two-step adaptive scheme for tile-based 360$^\circ$ video streaming to support enhanced multi-user viewing quality-of-experience (QoE) in a time-slotted system. Specifically, each video chunk is first prefetched based on the predictive field-of-view (FoV), and the FoV quality of the chunk is then enhanced at a closer-to-playback time instant based on the updated FoV prediction with improved accuracy. Both transmission-driven and device video super-resolution (VSR)-driven methods are adaptively selected to enable efficient video chunk enhancement. At each time slot, the incremental QoE gain of each user is characterized via a time-difference approach, based on which the best candidate chunk is determined. Then, a single-slot problem is formulated for maximizing the total incremental QoE gain while minimizing the total device energy consumption. A particle swarm optimization (PSO)-based iterative solution is proposed to obtain optimal bandwidth allocation, bitrate level selection, and enhancement method selection for multiple users. Extensive simulation results demonstrate that our proposed solution outperforms benchmark schemes in terms of average viewing QoE, average device energy consumption, and average utility.
No abstract available
360° videos provide an immersive experience to users, but require considerably more bandwidth to stream compared to regular videos. State-of-the-art 360° video streaming systems use viewport prediction to reduce bandwidth requirement, that involves predicting which part of the video the user will view and only fetching that content. However, viewport prediction is error prone resulting in poor user Quality of Experience (QoE). We design PARSEC, a 360° video streaming system that reduces bandwidth requirement while improving video quality. PARSEC trades off bandwidth for additional client-side computation to achieve its goals. PARSEC uses an approach based on super-resolution, where the video is significantly compressed at the server and the client runs a deep learning model to enhance the video to a much higher quality. PARSEC addresses a set of challenges associated with using super-resolution for 360° video streaming: large deep learning models, slow inference rate, and variance in the quality of the enhanced videos. To this end, PAR-SEC trains small micro-models over shorter video segments, and then combines traditional video encoding with super-resolution techniques to overcome the challenges. We evaluate PARSEC on a real WiFi network, over a broadband network trace released by FCC, and over a 4G/LTE network trace. PARSEC significantly outperforms the state-of-art 360° video streaming systems while reducing the bandwidth requirement.
360-degree video has shown great potential to the mainstream since its immersive experience. However, 360-degree video streaming requires ultrahigh bandwidth and low latency, which limit the improvement of user quality of experience (QoE). Currently, methods combining field of view (FoV) prediction and adaptive video streaming provide an effective method for addressing the above issues. However, existing FoV prediction methods based on recurrent neural networks (RNN) cannot capture long-range dependency from input to output. Current deep reinforcement learning (DRL)-based adaptive strategies fail to estimate the future bandwidth with high accuracy and fully explore the capability of VR devices. To ameliorate these limitations, we design a DRL-based 360-degree video streaming method named VRFormer with FoV combined prediction and super resolution (SR). First, we adopt a content-aware transformer-based encoder-decoder network to make the long-term FoV prediction. It combines the user's head movement history, eye-tracking history, and user attention extracted from a convolutional neural network (CNN)-based network. Second, we introduce a DNN-based SR network running on VR devices to reconstruct high-definition video content. Finally, we apply a DRL-based network to adaptively allocate rates for future tiles and dynamically control video content reconstruction. Experiments have verified that the proposed method can effectively improve the quality of experience (QoE) of the user's viewing experience compared to the state-of-the-art methods.
The restriction of network resources has forced cloud Virtual Reality service providers to only transmit low-resolution 360-degree images to Virtual Reality devices, leading to unpleasant user experience. Deep learning-based single image super-resolution approaches are commonly used for transforming low-resolution images into high-resolution versions, but these approaches are unable to deal with a dataset which has an extremely low number of training image samples. Moreover, current single image training models cannot deal with 360-degree images with very large image sizes. Therefore, we propose a 360-degree image super-resolution method which can train a super-resolution model on a single 360-degree image sample by using image patching techniques and a generative adversarial network. We also propose an improved Generative Adversarial Network (GAN) model structure named Progressive Residual GAN (PRGAN), which learns the image in a rough-to-fine way using progressively growing residual blocks and preserves structural and textural information with multi-level skip connections. Experiments on a street view panorama image dataset prove that our image super-resolution method outperforms several baseline methods in multiple image quality evaluation metrics, meanwhile keeping the generator model computational efficient.
360-degree videos have gained increasing popularity due to its capability to provide users with immersive viewing experience. Given the limited network bandwidth, it is a common approach to only stream video tiles in the user's Field-of-View (FoV) with high quality. However, it is difficult to perform accurate FoV prediction due to diverse user behaviors and time-varying network conditions. In this paper, we re-design the 360-degree video streaming systems by leveraging the technique of super-resolution (SR). The basic idea of our proposed SR360 framework is to utilize abundant computation resources on the user devices to trade off a reduction of network bandwidth. In the SR360 framework, a video tile with low resolution can be boosted to a video tile with high resolution using SR techniques at the client side. We adopt the theory of deep reinforcement learning (DRL) to make a set of decisions jointly, including user FoV prediction, bitrate allocation and SR enhancement. By conducting extensive trace-driven evaluations, we compare the performance of our proposed SR360 with other state-of-the-art methods and the results show that SR360 significantly outperforms other methods by at least 30% on average under different QoE metrics.
In recent years, streamed 360° videos have gained popularity within Virtual Reality (VR) and Augmented Reality (AR) applications. However, they are of much higher resolutions than 2D videos, causing greater bandwidth consumption when streamed. This increased bandwidth utilization puts tremendous strain on the network capacity of the cloud providers streaming these videos. In this paper, we introduce L3BOU, a novel, three-tier distributed software framework that reduces cloud-edge bandwidth in the backhaul network and lowers average end-to-end latency for 360° video streaming applications. The L3BOU framework achieves low bandwidth and low latency by leveraging edge-based, optimized upscaling techniques. L3BOU accomplishes this by utilizing down-scaled MPEG-DASH-encoded 360° video data, known as Ultra Low Resolution (ULR) data, that the L3BOU edge applies distributed super-resolution (SR) techniques on, providing a high quality video to the client. L3BOU is able to reduce the cloud-edge backhaul bandwidth by up to a factor of 24, and the optimized super-resolution multi-processing of ULR data provides a 10-fold latency decrease in super resolution upscaling at the edge.
Deep convolutional neural networks (CNNs) have been widely applied for low-level vision over the past five years. According to nature of different applications, designing appropriate CNN architectures is developed. However, customized architectures gather different features via treating all pixel points as equal to improve the performance of given application, which ignores the effects of local power pixel points and results in low training efficiency. In this paper, we propose an asymmetric CNN (ACNet) comprising an asymmetric block (AB), a memory enhancement block (MEB) and a high-frequency feature enhancement block (HFFEB) for image super-resolution. The AB utilizes one-dimensional asymmetric convolutions to intensify the square convolution kernels in horizontal and vertical directions for promoting the influences of local salient features for SISR. The MEB fuses all hierarchical low-frequency features from the AB via residual learning (RL) technique to resolve the long-term dependency problem and transforms obtained low-frequency features into high-frequency features. The HFFEB exploits low- and high-frequency features to obtain more robust super-resolution features and address excessive feature enhancement problem. Addditionally, it also takes charge of reconstructing a high-resolution (HR) image. Extensive experiments show that our ACNet can effectively address single image super-resolution (SISR), blind SISR and blind SISR of blind noise problems. The code of the ACNet is shown at https://github.com/hellloxiaotian/ACNet.
Recently, the methods based on implicit neural representations have shown excellent capabilities for arbitrary-scale super-resolution (ASSR). Although these methods represent the features of an image by generating latent codes, these latent codes are difficult to adapt for different magnification factors of super-resolution, which seriously affects their performance. Addressing this, we design Multi-Scale Implicit Transformer (MSIT), consisting of an Multi-scale Neural Operator (MSNO) and Multi-Scale Self-Attention (MSSA). Among them, MSNO obtains multi-scale latent codes through feature enhancement, multi-scale characteristics extraction, and multi-scale characteristics merging. MSSA further enhances the multi-scale characteristics of latent codes, resulting in better performance. Furthermore, to improve the performance of network, we propose the Re-Interaction Module (RIM) combined with the cumulative training strategy to improve the diversity of learned information for the network. We have systematically introduced multi-scale characteristics for the first time in ASSR, extensive experiments are performed to validate the effectiveness of MSIT, and our method achieves state-of-the-art performance in arbitrary super-resolution tasks.
This paper reviews the NTIRE2021 challenge on burst super-resolution. Given a RAW noisy burst as input, the task in the challenge was to generate a clean RGB image with 4 times higher resolution. The challenge contained two tracks; Track 1 evaluating on synthetically generated data, and Track 2 using real-world bursts from mobile camera. In the final testing phase, 6 teams submitted results using a diverse set of solutions. The top-performing methods set a new state-of-the-art for the burst super-resolution task.
Benefited from the deep learning, image Super-Resolution has been one of the most developing research fields in computer vision. Depending upon whether using a discriminator or not, a deep convolutional neural network can provide an image with high fidelity or better perceptual quality. Due to the lack of ground truth images in real life, people prefer a photo-realistic image with low fidelity to a blurry image with high fidelity. In this paper, we revisit the classic example based image super-resolution approaches and come up with a novel generative model for perceptual image super-resolution. Given that real images contain various noise and artifacts, we propose a joint image denoising and super-resolution model via Variational AutoEncoder. We come up with a conditional variational autoencoder to encode the reference for dense feature vector which can then be transferred to the decoder for target image denoising. With the aid of the discriminator, an additional overhead of super-resolution subnetwork is attached to super-resolve the denoised image with photo-realistic visual quality. We participated the NTIRE2020 Real Image Super-Resolution Challenge. Experimental results show that by using the proposed approach, we can obtain enlarged images with clean and pleasant features compared to other supervised methods. We also compared our approach with state-of-the-art methods on various datasets to demonstrate the efficiency of our proposed unsupervised super-resolution model.
Recently, Transformer-based methods have achieved impressive results in single image super-resolution (SISR). However, the lack of locality mechanism and high complexity limit their application in the field of super-resolution (SR). To solve these problems, we propose a new method, Efficient Mixed Transformer (EMT) in this study. Specifically, we propose the Mixed Transformer Block (MTB), consisting of multiple consecutive transformer layers, in some of which the Pixel Mixer (PM) is used to replace the Self-Attention (SA). PM can enhance the local knowledge aggregation with pixel shifting operations. At the same time, no additional complexity is introduced as PM has no parameters and floating-point operations. Moreover, we employ striped window for SA (SWSA) to gain an efficient global dependency modelling by utilizing image anisotropy. Experimental results show that EMT outperforms the existing methods on benchmark dataset and achieved state-of-the-art performance. The Code is available at https://github.com/Fried-Rice-Lab/FriedRiceLab.
Super-Resolution (SR) is a fundamental computer vision task that aims to obtain a high-resolution clean image from the given low-resolution counterpart. This paper reviews the NTIRE 2021 Challenge on Video Super-Resolution. We present evaluation results from two competition tracks as well as the proposed solutions. Track 1 aims to develop conventional video SR methods focusing on the restoration quality. Track 2 assumes a more challenging environment with lower frame rates, casting spatio-temporal SR problem. In each competition, 247 and 223 participants have registered, respectively. During the final testing phase, 14 teams competed in each track to achieve state-of-the-art performance on video SR tasks.
With the emergence of image super-resolution (SR) algorithm, how to blindly evaluate the quality of super-resolution images has become an urgent task. However, existing blind SR image quality assessment (IQA) metrics merely focus on visual characteristics of super-resolution images, ignoring the available scale information. In this paper, we reveal that the scale factor has a statistically significant impact on subjective quality scores of SR images, indicating that the scale information can be used to guide the task of blind SR IQA. Motivated by this, we propose a scale guided hypernetwork framework that evaluates SR image quality in a scale-adaptive manner. Specifically, the blind SR IQA procedure is divided into three stages, i.e., content perception, evaluation rule generation, and quality prediction. After content perception, a hypernetwork generates the evaluation rule used in quality prediction based on the scale factor of the SR image. We apply the proposed scale guided hypernetwork framework to existing representative blind IQA metrics, and experimental results show that the proposed framework not only boosts the performance of these IQA metrics but also enhances their generalization abilities. Source code will be available at https://github.com/JunFu1995/SGH.
Deep neural networks have exhibited remarkable performance in image super-resolution (SR) tasks by learning a mapping from low-resolution (LR) images to high-resolution (HR) images. However, the SR problem is typically an ill-posed problem and existing methods would come with several limitations. First, the possible mapping space of SR can be extremely large since there may exist many different HR images that can be super-resolved from the same LR image. As a result, it is hard to directly learn a promising SR mapping from such a large space. Second, it is often inevitable to develop very large models with extremely high computational cost to yield promising SR performance. In practice, one can use model compression techniques to obtain compact models by reducing model redundancy. Nevertheless, it is hard for existing model compression methods to accurately identify the redundant components due to the extremely large SR mapping space. To alleviate the first challenge, we propose a dual regression learning scheme to reduce the space of possible SR mappings. Specifically, in addition to the mapping from LR to HR images, we learn an additional dual regression mapping to estimate the downsampling kernel and reconstruct LR images. In this way, the dual mapping acts as a constraint to reduce the space of possible mappings. To address the second challenge, we propose a dual regression compression (DRC) method to reduce model redundancy in both layer-level and channel-level based on channel pruning. Specifically, we first develop a channel number search method that minimizes the dual regression loss to determine the redundancy of each layer. Given the searched channel numbers, we further exploit the dual regression manner to evaluate the importance of channels and prune the redundant ones. Extensive experiments show the effectiveness of our method in obtaining accurate and efficient SR models.
In recent years, Vision Transformer-based approaches for low-level vision tasks have achieved widespread success. Unlike CNN-based models, Transformers are more adept at capturing long-range dependencies, enabling the reconstruction of images utilizing non-local information. In the domain of super-resolution, Swin-transformer-based models have become mainstream due to their capability of global spatial information modeling and their shifting-window attention mechanism that facilitates the interchange of information between different windows. Many researchers have enhanced model performance by expanding the receptive fields or designing meticulous networks, yielding commendable results. However, we observed that it is a general phenomenon for the feature map intensity to be abruptly suppressed to small values towards the network's end. This implies an information bottleneck and a diminishment of spatial information, implicitly limiting the model's potential. To address this, we propose the Dense-residual-connected Transformer (DRCT), aimed at mitigating the loss of spatial information and stabilizing the information flow through dense-residual connections between layers, thereby unleashing the model's potential and saving the model away from information bottleneck. Experiment results indicate that our approach surpasses state-of-the-art methods on benchmark datasets and performs commendably at the NTIRE-2024 Image Super-Resolution (x4) Challenge. Our source code is available at https://github.com/ming053l/DRCT
Aligning generative real-world image super-resolution models with human visual preference is challenging due to the perception--fidelity trade-off and diverse, unknown degradations. Prior approaches rely on offline preference optimization and static metric aggregation, which are often non-interpretable and prone to pseudo-diversity under strong conditioning. We propose OARS, a process-aware online alignment framework built on COMPASS, a MLLM-based reward that evaluates the LR to SR transition by jointly modeling fidelity preservation and perceptual gain with an input-quality-adaptive trade-off. To train COMPASS, we curate COMPASS-20K spanning synthetic and real degradations, and introduce a three-stage perceptual annotation pipeline that yields calibrated, fine-grained training labels. Guided by COMPASS, OARS performs progressive online alignment from cold-start flow matching to full-reference and finally reference-free RL via shallow LoRA optimization for on-policy exploration. Extensive experiments and user studies demonstrate consistent perceptual improvements while maintaining fidelity, achieving state-of-the-art performance on Real-ISR benchmarks.
Despite significant progress toward super resolving more realistic images by deeper convolutional neural networks (CNNs), reconstructing fine and natural textures still remains a challenging problem. Recent works on single image super resolution (SISR) are mostly based on optimizing pixel and content wise similarity between recovered and high-resolution (HR) images and do not benefit from recognizability of semantic classes. In this paper, we introduce a novel approach using categorical information to tackle the SISR problem; we present a decoder architecture able to extract and use semantic information to super-resolve a given image by using multitask learning, simultaneously for image super-resolution and semantic segmentation. To explore categorical information during training, the proposed decoder only employs one shared deep network for two task-specific output layers. At run-time only layers resulting HR image are used and no segmentation label is required. Extensive perceptual experiments and a user study on images randomly selected from COCO-Stuff dataset demonstrate the effectiveness of our proposed method and it outperforms the state-of-the-art methods.
This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/.
Lightweight and efficiency are critical drivers for the practical application of image super-resolution (SR) algorithms. We propose a simple and effective approach, ShuffleMixer, for lightweight image super-resolution that explores large convolution and channel split-shuffle operation. In contrast to previous SR models that simply stack multiple small kernel convolutions or complex operators to learn representations, we explore a large kernel ConvNet for mobile-friendly SR design. Specifically, we develop a large depth-wise convolution and two projection layers based on channel splitting and shuffling as the basic component to mix features efficiently. Since the contexts of natural images are strongly locally correlated, using large depth-wise convolutions only is insufficient to reconstruct fine details. To overcome this problem while maintaining the efficiency of the proposed module, we introduce Fused-MBConvs into the proposed network to model the local connectivity of different features. Experimental results demonstrate that the proposed ShuffleMixer is about 6x smaller than the state-of-the-art methods in terms of model parameters and FLOPs while achieving competitive performance. In NTIRE 2022, our primary method won the model complexity track of the Efficient Super-Resolution Challenge [23]. The code is available at https://github.com/sunny2109/MobileSR-NTIRE2022.
Image super-resolution (SR) serves as a fundamental tool for the processing and transmission of multimedia data. Recently, Transformer-based models have achieved competitive performances in image SR. They divide images into fixed-size patches and apply self-attention on these patches to model long-range dependencies among pixels. However, this architecture design is originated for high-level vision tasks, which lacks design guideline from SR knowledge. In this paper, we aim to design a new attention block whose insights are from the interpretation of Local Attribution Map (LAM) for SR networks. Specifically, LAM presents a hierarchical importance map where the most important pixels are located in a fine area of a patch and some less important pixels are spread in a coarse area of the whole image. To access pixels in the coarse area, instead of using a very large patch size, we propose a lightweight Global Pixel Access (GPA) module that applies cross-attention with the most similar patch in an image. In the fine area, we use an Intra-Patch Self-Attention (IPSA) module to model long-range pixel dependencies in a local patch, and then a $3\times3$ convolution is applied to process the finest details. In addition, a Cascaded Patch Division (CPD) strategy is proposed to enhance perceptual quality of recovered images. Extensive experiments suggest that our method outperforms state-of-the-art lightweight SR methods by a large margin. Code is available at https://github.com/passerer/HPINet.
The capability of video super-resolution (VSR) to synthesize high-resolution (HR) video from ideal datasets has been demonstrated in many works. However, applying the VSR model to real-world video with unknown and complex degradation remains a challenging task. First, existing degradation metrics in most VSR methods are not able to effectively simulate real-world noise and blur. On the contrary, simple combinations of classical degradation are used for real-world noise modeling, which led to the VSR model often being violated by out-of-distribution noise. Second, many SR models focus on noise simulation and transfer. Nevertheless, the sampled noise is monotonous and limited. To address the aforementioned problems, we propose a Negatives augmentation strategy for generalized noise modeling in Video Super-Resolution (NegVSR) task. Specifically, we first propose sequential noise generation toward real-world data to extract practical noise sequences. Then, the degeneration domain is widely expanded by negative augmentation to build up various yet challenging real-world noise sets. We further propose the augmented negative guidance loss to learn robust features among augmented negatives effectively. Extensive experiments on real-world datasets (e.g., VideoLQ and FLIR) show that our method outperforms state-of-the-art methods with clear margins, especially in visual quality. Project page is available at: https://negvsr.github.io/.
Wide-baseline panoramic images are frequently used in applications like VR and simulations to minimize capturing labor costs and storage needs. However, synthesizing novel views from these panoramic images in real time remains a significant challenge, especially due to panoramic imagery's high resolution and inherent distortions. Although existing 3D Gaussian splatting (3DGS) methods can produce photo-realistic views under narrow baselines, they often overfit the training views when dealing with wide-baseline panoramic images due to the difficulty in learning precise geometry from sparse 360$^{\circ}$ views. This paper presents \textit{Splatter-360}, a novel end-to-end generalizable 3DGS framework designed to handle wide-baseline panoramic images. Unlike previous approaches, \textit{Splatter-360} performs multi-view matching directly in the spherical domain by constructing a spherical cost volume through a spherical sweep algorithm, enhancing the network's depth perception and geometry estimation. Additionally, we introduce a 3D-aware bi-projection encoder to mitigate the distortions inherent in panoramic images and integrate cross-view attention to improve feature interactions across multiple viewpoints. This enables robust 3D-aware feature representations and real-time rendering capabilities. Experimental results on the HM3D~\cite{hm3d} and Replica~\cite{replica} demonstrate that \textit{Splatter-360} significantly outperforms state-of-the-art NeRF and 3DGS methods (e.g., PanoGRF, MVSplat, DepthSplat, and HiSplat) in both synthesis quality and generalization performance for wide-baseline panoramic images. Code and trained models are available at \url{https://3d-aigc.github.io/Splatter-360/}.
Despite the advances of deep learning in specific tasks using images, the principled assessment of image fidelity and similarity is still a critical ability to develop. As it has been shown that Mean Squared Error (MSE) is insufficient for this task, other measures have been developed with one of the most effective being Structural Similarity Index (SSIM). Such measures can be used for subspace learning but existing methods in machine learning, such as Principal Component Analysis (PCA), are based on Euclidean distance or MSE and thus cannot properly capture the structural features of images. In this paper, we define an image structure subspace which discriminates different types of image distortions. We propose Image Structural Component Analysis (ISCA) and also kernel ISCA by using SSIM, rather than Euclidean distance, in the formulation of PCA. This paper provides a bridge between image quality assessment and manifold learning opening a broad new area for future research.
We address the problem of active visual exploration of large 360° inputs. In our setting an active agent with a limited camera bandwidth explores its 360° environment by changing its viewing direction at limited discrete time steps. As such, it observes the world as a sequence of narrow field-of-view 'glimpses', deciding for itself where to look next. Our proposed method exceeds previous works' performance by a significant margin without the need for deep reinforcement learning or training separate networks as sidekicks. A key component of our system are the spatial memory maps that make the system aware of the glimpses' orientations (locations in the 360° image). Further, we stress the advantages of retina-like glimpses when the agent's sensor bandwidth and time-steps are limited. Finally, we use our trained model to do classification of the whole scene using only the information observed in the glimpses.
Image restoration networks are usually comprised of an encoder and a decoder, responsible for aggregating image content from noisy, distorted data and to restore clean, undistorted images, respectively. Data aggregation as well as high-resolution image generation both usually come at the risk of involving aliases, i.e.~standard architectures put their ability to reconstruct the model input in jeopardy to reach high PSNR values on validation data. The price to be paid is low model robustness. In this work, we show that simply providing alias-free paths in state-of-the-art reconstruction transformers supports improved model robustness at low costs on the restoration performance. We do so by proposing BOA-Restormer, a transformer-based image restoration model that executes downsampling and upsampling operations partly in the frequency domain to ensure alias-free paths along the entire model while potentially preserving all relevant high-frequency information.
Many real-world solutions for image restoration are learning-free and based on handcrafted image priors such as self-similarity. Recently, deep-learning methods that use training data have achieved state-of-the-art results in various image restoration tasks (e.g., super-resolution and inpainting). Ulyanov et al. bridge the gap between these two families of methods (CVPR 18). They have shown that learning-free methods perform close to the state-of-the-art learning-based methods (approximately 1 PSNR). Their approach benefits from the encoder-decoder network. In this paper, we propose a framework based on the multi-level extensions of the encoder-decoder network, to investigate interesting aspects of the relationship between image restoration and network construction independent of learning. Our framework allows various network structures by modifying the following network components: skip links, cascading of the network input into intermediate layers, a composition of the encoder-decoder subnetworks, and network depth. These handcrafted network structures illustrate how the construction of untrained networks influence the following image restoration tasks: denoising, super-resolution, and inpainting. We also demonstrate image reconstruction using flash and no-flash image pairs. We provide performance comparisons with the state-of-the-art methods for all the restoration tasks above.
Following their success in visual recognition tasks, Vision Transformers(ViTs) are being increasingly employed for image restoration. As a few recent works claim that ViTs for image classification also have better robustness properties, we investigate whether the improved adversarial robustness of ViTs extends to image restoration. We consider the recently proposed Restormer model, as well as NAFNet and the "Baseline network" which are both simplified versions of a Restormer. We use Projected Gradient Descent (PGD) and CosPGD, a recently proposed adversarial attack tailored to pixel-wise prediction tasks for our robustness evaluation. Our experiments are performed on real-world images from the GoPro dataset for image deblurring. Our analysis indicates that contrary to as advocated by ViTs in image classification works, these models are highly susceptible to adversarial attacks. We attempt to improve their robustness through adversarial training. While this yields a significant increase in robustness for Restormer, results on other networks are less promising. Interestingly, the design choices in NAFNet and Baselines, which were based on iid performance, and not on robust generalization, seem to be at odds with the model robustness. Thus, we investigate this further and find a fix.
Learning a good image prior is a long-term goal for image restoration and manipulation. While existing methods like deep image prior (DIP) capture low-level image statistics, there are still gaps toward an image prior that captures rich image semantics including color, spatial coherence, textures, and high-level concepts. This work presents an effective way to exploit the image prior captured by a generative adversarial network (GAN) trained on large-scale natural images. As shown in Fig.1, the deep generative prior (DGP) provides compelling results to restore missing semantics, e.g., color, patch, resolution, of various degraded images. It also enables diverse image manipulation including random jittering, image morphing, and category transfer. Such highly flexible restoration and manipulation are made possible through relaxing the assumption of existing GAN-inversion methods, which tend to fix the generator. Notably, we allow the generator to be fine-tuned on-the-fly in a progressive manner regularized by feature distance obtained by the discriminator in GAN. We show that these easy-to-implement and practical changes help preserve the reconstruction to remain in the manifold of nature image, and thus lead to more precise and faithful reconstruction for real images. Code is available at https://github.com/XingangPan/deep-generative-prior.
We introduce the first learning-based dense matching algorithm, termed Equirectangular Projection-Oriented Dense Kernelized Feature Matching (EDM), specifically designed for omnidirectional images. Equirectangular projection (ERP) images, with their large fields of view, are particularly suited for dense matching techniques that aim to establish comprehensive correspondences across images. However, ERP images are subject to significant distortions, which we address by leveraging the spherical camera model and geodesic flow refinement in the dense matching method. To further mitigate these distortions, we propose spherical positional embeddings based on 3D Cartesian coordinates of the feature grid. Additionally, our method incorporates bidirectional transformations between spherical and Cartesian coordinate systems during refinement, utilizing a unit sphere to improve matching performance. We demonstrate that our proposed method achieves notable performance enhancements, with improvements of +26.72 and +42.62 in AUC@5° on the Matterport3D and Stanford2D3D datasets.
Disconnectivity and distortion are the two problems which must be coped with when processing 360 degrees equirectangular images. In this paper, we propose a method of estimating the depth of monocular panoramic image with a teacher-student model fusing equirectangular and spherical representations. In contrast with the existing methods fusing an equirectangular representation with a cube map representation or tangent representation, a spherical representation is a better choice because a sampling on a sphere is more uniform and can also cope with distortion more effectively. In this processing, a novel spherical convolution kernel computing with sampling points on a sphere is developed to extract features from the spherical representation, and then, a Segmentation Feature Fusion(SFF) methodology is utilized to combine the features with ones extracted from the equirectangular representation. In contrast with the existing methods using a teacher-student model to obtain a lighter model of depth estimation, we use a teacher-student model to learn the latent features of depth images. This results in a trained model which estimates the depth map of an equirectangular image using not only the feature maps extracted from an input equirectangular image but also the distilled knowledge learnt from the ground truth of depth map of a training set. In experiments, the proposed method is tested on several well-known 360 monocular depth estimation benchmark datasets, and outperforms the existing methods for the most evaluation indexes.
Estimating the depths of equirectangular (i.e., 360) images (EIs) is challenging given the distorted 180 x 360 field-of-view, which is hard to be addressed via convolutional neural network (CNN). Although a transformer with global attention achieves significant improvements over CNN for EI depth estimation task, it is computationally inefficient, which raises the need for transformer with local attention. However, to apply local attention successfully for EIs, a specific strategy, which addresses distorted equirectangular geometry and limited receptive field simultaneously, is required. Prior works have only cared either of them, resulting in unsatisfactory depths occasionally. In this paper, we propose an equirectangular geometry-biased transformer termed EGformer. While limiting the computational cost and the number of network parameters, EGformer enables the extraction of the equirectangular geometry-aware local attention with a large receptive field. To achieve this, we actively utilize the equirectangular geometry as the bias for the local attention instead of struggling to reduce the distortion of EIs. As compared to the most recent EI depth estimation studies, the proposed approach yields the best depth outcomes overall with the lowest computational cost and the fewest parameters, demonstrating the effectiveness of the proposed methods.
This study proposes a practical approach for compressing 360-degree equirectangular videos using pretrained neural video compression (NVC) models. Without requiring additional training or changes in the model architectures, the proposed method extends quantization parameter adaptation techniques from traditional video codecs to NVC, utilizing the spatially varying sampling density in equirectangular projections. We introduce latitude-based adaptive quality parameters through rate-distortion optimization for NVC. The proposed method utilizes vector bank interpolation for latent modulation, enabling flexible adaptation with arbitrary quality parameters and mitigating the limitations caused by rounding errors in the adaptive quantization parameters. Experimental results demonstrate that applying this method to the DCVC-RT framework yields BD-Rate savings of 5.2% in terms of the weighted spherical peak signal-to-noise ratio for JVET class S1 test sequences, with only a 0.3% increase in processing time.
Signal degradation is ubiquitous and computational restoration of degraded signal has been investigated for many years. Recently, it is reported that the capability of signal restoration is fundamentally limited by the perception-distortion tradeoff, i.e. the distortion and the perceptual difference between the restored signal and the ideal `original' signal cannot be made both minimal simultaneously. Distortion corresponds to signal fidelity and perceptual difference corresponds to perceptual naturalness, both of which are important metrics in practice. Besides, there is another dimension worthy of consideration, namely the semantic quality or the utility for recognition purpose, of the restored signal. In this paper, we extend the previous perception-distortion tradeoff to the case of classification-distortion-perception (CDP) tradeoff, where we introduced the classification error rate of the restored signal in addition to distortion and perceptual difference. Two versions of the CDP tradeoff are considered, one using a predefined classifier and the other dealing with the optimal classifier for the restored signal. For both versions, we can rigorously prove the existence of the CDP tradeoff, i.e. the distortion, perceptual difference, and classification error rate cannot be made all minimal simultaneously. Our findings can be useful especially for computer vision researches where some low-level vision tasks (signal restoration) serve for high-level vision tasks (visual understanding).
The complex traffic environment and various weather conditions make the collection of LiDAR data expensive and challenging. Achieving high-quality and controllable LiDAR data generation is urgently needed, controlling with text is a common practice, but there is little research in this field. To this end, we propose Text2LiDAR, the first efficient, diverse, and text-controllable LiDAR data generation model. Specifically, we design an equirectangular transformer architecture, utilizing the designed equirectangular attention to capture LiDAR features in a manner with data characteristics. Then, we design a control-signal embedding injector to efficiently integrate control signals through the global-to-focused attention mechanism. Additionally, we devise a frequency modulator to assist the model in recovering high-frequency details, ensuring the clarity of the generated point cloud. To foster development in the field and optimize text-controlled generation performance, we construct nuLiDARtext which offers diverse text descriptors for 34,149 LiDAR point clouds from 850 scenes. Experiments on uncontrolled and text-controlled generation in various forms on KITTI-360 and nuScenes datasets demonstrate the superiority of our approach.
A novel square equal-area map projection is proposed. The projection combines closed-form forward and inverse solutions with relatively low angular distortion and minimal cusps, a combination of properties not manifested by any previously published square equal-area projection. Thus, the new projection has lower angular distortion than any previously published square equal-area projection with a closed-form solution. Utilizing a quincuncial arrangement, the new projection places the north pole at the center of the square and divides the south pole between its four corners; the projection can be seamlessly tiled. The existence of closed-form solutions makes the projection suitable for real-time visualization applications, both in cartography and in other areas, such as for the display of panoramic images.
Deep convolutional neural network based image super-resolution (SR) models have shown superior performance in recovering the underlying high resolution (HR) images from low resolution (LR) images obtained from the predefined downscaling methods. In this paper we propose a learned image downscaling method based on content adaptive resampler (CAR) with consideration on the upscaling process. The proposed resampler network generates content adaptive image resampling kernels that are applied to the original HR input to generate pixels on the downscaled image. Moreover, a differentiable upscaling (SR) module is employed to upscale the LR result into its underlying HR counterpart. By back-propagating the reconstruction error down to the original HR input across the entire framework to adjust model parameters, the proposed framework achieves a new state-of-the-art SR performance through upscaling guided image resamplers which adaptively preserve detailed information that is essential to the upscaling. Experimental results indicate that the quality of the generated LR image is comparable to that of the traditional interpolation based method, but the significant SR performance gain is achieved by deep SR models trained jointly with the CAR model. The code is publicly available on: URL https://github.com/sunwj/CAR.
With advances in artificial intelligence, image processing has gained significant interest. Image super-resolution is a vital technology closely related to real-world applications, as it enhances the quality of existing images. Since enhancing fine details is crucial for the super-resolution task, pixels that contribute to high-frequency information should be emphasized. This paper proposes two methods to enhance high-frequency details in super-resolution images: a Laplacian pyramid-based detail loss and a repeated upscaling and downscaling process. Total loss with our detail loss guides a model by separately generating and controlling super-resolution and detail images. This approach allows the model to focus more effectively on high-frequency components, resulting in improved super-resolution images. Additionally, repeated upscaling and downscaling amplify the effectiveness of the detail loss by extracting diverse information from multiple low-resolution features. We conduct two types of experiments. First, we design a CNN-based model incorporating our methods. This model achieves state-of-the-art results, surpassing all currently available CNN-based and even some attention-based models. Second, we apply our methods to existing attention-based models on a small scale. In all our experiments, attention-based models adding our detail loss show improvements compared to the originals. These results demonstrate our approaches effectively enhance super-resolution images across different model structures.
In this paper, we tackle the challenging task of Panoramic Image-to-Image translation (Pano-I2I) for the first time. This task is difficult due to the geometric distortion of panoramic images and the lack of a panoramic image dataset with diverse conditions, like weather or time. To address these challenges, we propose a panoramic distortion-aware I2I model that preserves the structure of the panoramic images while consistently translating their global style referenced from a pinhole image. To mitigate the distortion issue in naive 360 panorama translation, we adopt spherical positional embedding to our transformer encoders, introduce a distortion-free discriminator, and apply sphere-based rotation for augmentation and its ensemble. We also design a content encoder and a style encoder to be deformation-aware to deal with a large domain gap between panoramas and pinhole images, enabling us to work on diverse conditions of pinhole images. In addition, considering the large discrepancy between panoramas and pinhole images, our framework decouples the learning procedure of the panoramic reconstruction stage from the translation stage. We show distinct improvements over existing I2I models in translating the StreetLearn dataset in the daytime into diverse conditions. The code will be publicly available online for our community.
The 360°imaging has recently gained great attention; however, its angular resolution is relatively lower than that of a narrow field-of-view (FOV) perspective image as it is captured by using fisheye lenses with the same sensor size. Therefore, it is beneficial to super-resolve a 360°image. Some attempts have been made but mostly considered the equirectangular projection (ERP) as one of the way for 360°image representation despite of latitude-dependent distortions. In that case, as the output high-resolution(HR) image is always in the same ERP format as the low-resolution (LR) input, another information loss may occur when transforming the HR image to other projection types. In this paper, we propose SphereSR, a novel framework to generate a continuous spherical image representation from an LR 360°image, aiming at predicting the RGB values at given spherical coordinates for super-resolution with an arbitrary 360°image projection. Specifically, we first propose a feature extraction module that represents the spherical data based on icosahedron and efficiently extracts features on the spherical surface. We then propose a spherical local implicit image function (SLIIF) to predict RGB values at the spherical coordinates. As such, SphereSR flexibly reconstructs an HR image under an arbitrary projection type. Experiments on various benchmark datasets show that our method significantly surpasses existing methods.
全景图像增强与超分辨率领域已形成从底层通用超分、中层几何感知算法到上层流媒体系统优化的多级研究体系。当前的技术前沿呈现出以下特征:一是由单纯的像素提升转向深度的几何特性建模(如纬度感知与球形隐式表示);二是与生成式AI(扩散模型、Mamba)深度融合实现内容补全与扩展;三是系统层面的实时性与能效优化成为VR/AR商业化落地的关键;四是全景技术正向水下、航空、自动驾驶等垂直领域快速渗透。