视频增强,视频超分
经典视频超分架构:传播、对齐与循环机制
该组文献专注于视频超分(VSR)的基础骨干网络设计,探讨如何通过变形卷积(DCN)、隐式对齐、双向循环神经网络(RNN)及改进的U-Net结构实现高效的时空特征聚合。
- BasicVSR: The Search for Essential Components in Video Super-Resolution and Beyond(Kelvin C. K. Chan, Xintao Wang, Ke Yu, Chao Dong, Chen Change Loy, 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- EDVR: Video Restoration with Enhanced Deformable Convolutional Networks(Xintao Wang, Kelvin C. K. Chan, Ke Yu, Chao Dong, Chen Change Loy, 2019, ArXiv Preprint)
- A Simple Baseline for Video Restoration with Grouped Spatial-temporal Shift(Dasong Li, Xiaoyu Shi, Yi Zhang, Ka Chun Cheung, Simon See, Xiaogang Wang, Hongwei Qin, Hongsheng Li, 2022, ArXiv Preprint)
- BasicVSR++: Improving Video Super-Resolution with Enhanced Propagation and Alignment(Kelvin C. K. Chan, Shangchen Zhou, Xiangyu Xu, Chen Change Loy, 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Frame-Recurrent Video Super-Resolution(Mehdi S. M. Sajjadi, Raviteja Vemulapalli, Matthew A. Brown, 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition)
- Recurrent Back-Projection Network for Video Super-Resolution(Muhammad Haris, Gregory Shakhnarovich, N. Ukita, 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- On the Generalization of BasicVSR++ to Video Deblurring and Denoising(Kelvin C. K. Chan, Shangchen Zhou, Xiangyu Xu, Chen Change Loy, 2022, ArXiv)
- VMG: Rethinking U-Net Architecture for Video Super-Resolution(Jun Tang, Lele Niu, Linlin Liu, Hang Dai, Yong Ding, 2025, IEEE Transactions on Broadcasting)
- Revisiting Temporal Alignment for Video Restoration(Kun Zhou, Wenbo Li, Liying Lu, Xiaoguang Han, Jiangbo Lu, 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- TDAN: Temporally-Deformable Alignment Network for Video Super-Resolution(Yapeng Tian, Yulun Zhang, Y. Fu, Chenliang Xu, 2018, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Bidirectional Recurrent Convolutional Networks for Multi-Frame Super-Resolution(Yan Huang, Wei Wang, Liang Wang, 2015, No journal)
- Deep Unrolled Network for Video Super-Resolution(Benjamin Naoto Chiche, J. Frontera-Pons, Arnaud Woiselle, Jean-Luc Starck, 2020, 2020 Tenth International Conference on Image Processing Theory, Tools and Applications (IPTA))
- Detail-Revealing Deep Video Super-Resolution(Xin Tao, Hongyun Gao, Renjie Liao, Jue Wang, Jiaya Jia, 2017, 2017 IEEE International Conference on Computer Vision (ICCV))
- DVSRNet: Deep Video Super-Resolution Based on Progressive Deformable Alignment and Temporal-Sparse Enhancement(Qiang Zhu, Feiyu Chen, Shuyuan Zhu, Yu Liu, Xue Zhou, Ruiqin Xiong, Bing Zeng, 2024, IEEE Transactions on Neural Networks and Learning Systems)
- Deep Video Super-Resolution Network Using Dynamic Upsampling Filters Without Explicit Motion Compensation(Younghyun Jo, Seoung Wug Oh, Jaeyeon Kang, Seon Joo Kim, 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition)
- ECL: Exclusive Curriculum Learning for Video Super-Resolution(Sichen Hu, Zhongyuan Wang, Peng Yi, Zheng He, Jinsheng Xiao, Jing Xiao, 2022, 2022 IEEE International Conference on Multimedia and Expo (ICME))
- TAM: Temporal Adaptive Module for Video Recognition(Zhaoyang Liu, Limin Wang, Wayne Wu, Chen Qian, Tong Lu, 2020, ArXiv Preprint)
- Feature Aggregating Network with Inter-Frame Interaction for Efficient Video Super-Resolution(Yawei Li, Zhao Zhang, Suiyi Zhao, Jicong Fan, Haijun Zhang, Mingliang Xu, 2023, 2023 IEEE International Conference on Data Mining (ICDM))
前沿长程建模:Transformer 与 Mamba 架构优化
利用Transformer的自注意力机制和Mamba的线性序列建模能力,解决传统CNN在处理大尺度运动对齐和长程时空依赖方面的局限性,提升全局一致性。
- VRT: A Video Restoration Transformer(Jingyun Liang, Jiezhang Cao, Yuchen Fan, Kai Zhang, Rakesh Ranjan, Yawei Li, Radu Timofte, Luc Van Gool, 2022, ArXiv Preprint)
- Video Super-Resolution Transformer(Jie Cao, Yawei Li, K. Zhang, L. Gool, 2021, ArXiv)
- Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention(Xingyu Zhou, Leheng Zhang, Xiaorui Zhao, Keze Wang, Leida Li, Shuhang Gu, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- CTVSR: Collaborative Spatial–Temporal Transformer for Video Super-Resolution(Jun Tang, Chenyang Lu, Zhengxue Liu, Jiale Li, Hang Dai, Yong Ding, 2024, IEEE Transactions on Circuits and Systems for Video Technology)
- Multi-Scale Video Super-Resolution Transformer With Polynomial Approximation(Fan Zhang, Gongguan Chen, Hua Wang, Jinjiang Li, Caiming Zhang, 2023, IEEE Transactions on Circuits and Systems for Video Technology)
- HAMSA: Hybrid attention transformer and multi-scale alignment aggregation network for video super-resolution(Hanguang Xiao, Hao Wen, Xin Wang, Kun Zuo, Tianqi Liu, Wei Wang, Yong Xu, 2025, Digit. Signal Process.)
- Combining optical flow and Swin Transformer for Space-Time video super-resolution(Xin Wang, Hua Wang, Mingling Zhang, Fan Zhang, 2024, Eng. Appl. Artif. Intell.)
- Temporal Transformer-Based Video Super-Resolution Reconstruction with Cross-Modal Attention(Jingming Gong, Qinfei Xu, 2025, Informatica (Slovenia))
- BSRT: Improving Burst Super-Resolution with Swin Transformer and Flow-Guided Deformable Alignment(Ziwei Luo, Youwei Li, Shen Cheng, Lei Yu, Qi Wu, Zhihong Wen, Haoqiang Fan, Jian Sun, Shuaicheng Liu, 2022, ArXiv Preprint)
- GridFormer-VSR: A Multi-Attention Vision Transformer for Video Super-Resolution(Anas M. Ali, W. El-shafai, El-Sayed M. El-Rabaie, F. A. Abd El‑Samie, K. Ramadan, 2025, 2025 4th International Conference on Electronic Engineering (ICEEM))
- Flow-Guided Sparse Transformer for Video Deblurring(Jing Lin, Yuanhao Cai, Xiaowan Hu, Haoqian Wang, Youliang Yan, Xueyi Zou, Henghui Ding, Yulun Zhang, Radu Timofte, Luc Van Gool, 2022, ArXiv Preprint)
- RECURRENT TRANSFORMER BASED FRAMEWORK FOR VIDEO DENOISING AND SUPER RESOLUTION USING OPTICAL FLOW AND TEMPORAL ATTENTION(Pitty Nagarjuna, B. Harsha, 2026, ICTACT Journal on Image and Video Processing)
- Transformer Channel Attention Network for Video Super-Resolution(Yooho Lee, Dongsan Jun, 2025, Journal of Korea Multimedia Society)
- DualX-VSR: Dual Axial Spatial⨉Temporal Transformer for Real-World Video Super-Resolution without Motion Compensation(Shuo Cao, Yihao Liu, Xiaohui Li, Yuanting Gao, Yu Zhou, Chao Dong, 2025, ArXiv)
- Rethinking Alignment in Video Super-Resolution Transformers(Shu Shi, Jinjin Gu, Liangbin Xie, Xintao Wang, Yujiu Yang, Chao Dong, 2022, ArXiv)
- VSRM: A Robust Mamba-Based Framework for Video Super-Resolution(D. Tran, Dao Duy Hung, Daeyoung Kim, 2025, ArXiv)
- VDTR: Video Deblurring with Transformer(Mingdeng Cao, Yanbo Fan, Yong Zhang, Jue Wang, Yujiu Yang, 2022, ArXiv Preprint)
生成式视频增强:基于扩散模型与 GAN 的高保真重建
涵盖最新的利用生成式扩散模型(Diffusion Models)和GAN进行视频修复的研究,重点解决纹理生成、单步采样效率以及生成过程中的时空闪烁一致性问题。
- DOVE: Efficient One-Step Diffusion Model for Real-World Video Super-Resolution(Zheng Chen, Zichen Zou, Kewei Zhang, Xiongfei Su, Xin Yuan, Yong Guo, Yulun Zhang, 2025, ArXiv)
- STCDiT: Spatio-Temporally Consistent Diffusion Transformer for High-Quality Video Super-Resolution(Junyang Chen, Jiangxin Dong, Long Sun, Yixin Yang, Jinshan Pan, 2025, ArXiv)
- FlashVSR: Towards Real-Time Diffusion-Based Streaming Video Super-Resolution(Junhao Zhuang, Shi Guo, Xin Cai, Xiaohui Li, Yihao Liu, Chun Yuan, Tianfan Xue, 2025, ArXiv)
- OSDEnhancer: Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion(Shuoyan Wei, Feng Li, Chen Zhou, Runmin Cong, Yao Zhao, Huihui Bai, 2026, ArXiv)
- Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution(Shangchen Zhou, Peiqing Yang, Jianyi Wang, Yihang Luo, Chen Change Loy, 2023, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- TDM: Temporally-Consistent Diffusion Model for All-in-One Real-World Video Restoration(Yizhou Li, Zihua Liu, Yusuke Monno, Masatoshi Okutomi, 2025, ArXiv)
- VideoGigaGAN: Towards Detail-rich Video Super-Resolution(Yiran Xu, Taesung Park, Richard Zhang, Yang Zhou, Eli Shechtman, Feng Liu, Jia-Bin Huang, Difan Liu, 2024, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation(Xin Li, Wenqing Chu, Ye Wu, Weihang Yuan, Fanglong Liu, Qi Zhang, Fu Li, Haocheng Feng, Errui Ding, Jingdong Wang, 2023, ArXiv)
- UltraVSR: Achieving Ultra-Realistic Video Super-Resolution with Efficient One-Step Diffusion Space(Yong Liu, Jinshan Pan, Yinchuan Li, Qingji Dong, Chao Zhu, Yu Guo, Fei Wang, 2025, Proceedings of the 33rd ACM International Conference on Multimedia)
- Rethinking Video Super-Resolution: Towards Diffusion-Based Methods without Motion Alignment(Zhihao Zhan, Wang Pang, Xiang Zhu, Yechao Bai, 2025, 2025 17th International Conference on Signal Processing Systems (ICSPS))
- STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution(Rui Xie, Yinhong Liu, Penghao Zhou, Chen Zhao, Jun Zhou, Kai Zhang, Zhenyu Zhang, Jian Yang, Zhenheng Yang, Ying Tai, 2025, ArXiv)
- MFSR-GAN: Multi-Frame Super-Resolution with Handheld Motion Modeling(Fadeel Sher Khan, Joshua Peter Ebenezer, Hamid R. Sheikh, Seok-Jun Lee, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
- Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution(Zhikai Chen, Fuchen Long, Zhaofan Qiu, Ting Yao, Wengang Zhou, Jiebo Luo, Tao Mei, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- DAM-VSR: Disentanglement of Appearance and Motion for Video Super-Resolution(Zhe Kong, Le Li, Yong Zhang, Feng Gao, Shaoshu Yang, Tao Wang, Kaihao Zhang, Zhuoliang Kang, Xiaoming Wei, Guanying Chen, Wenhan Luo, 2025, ArXiv Preprint)
- DiTVR: Zero-Shot Diffusion Transformer for Video Restoration(Sicheng Gao, Nancy Mehta, Zongwei Wu, R. Timofte, 2025, ArXiv)
- Improving Temporal Consistency and Fidelity at Inference-time in Perceptual Video Restoration by Zero-shot Image-based Diffusion Models(Nasrin Rahimi, L. F. I. A. Murat Tekalp, 2025, ArXiv)
- Multi-Frame Super-Resolution Algorithm Based on a WGAN(Keqing Ning, Zhihao Zhang, Kai Han, Siyu Han, Xiqing Zhang, 2021, IEEE Access)
时空联合超分 (ST-VSR) 与任意尺度重采样
探讨将空间超分与时间插帧(VFI)结合的单阶段任务,以及利用隐式神经表示(INR)实现任意空间尺度、任意时间点的连续视频重建技术。
- Zooming Slow-Mo: Fast and Accurate One-Stage Space-Time Video Super-Resolution(Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Y. Fu, J. Allebach, Chenliang Xu, 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Zooming SlowMo: An Efficient One-Stage Framework for Space-Time Video Super-Resolution(Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P. Allebach, Chenliang Xu, 2021, ArXiv Preprint)
- VJT: A Video Transformer on Joint Tasks of Deblurring, Low-light Enhancement and Denoising(Yuxiang Hui, Yang Liu, Yaofang Liu, Fan Jia, Jinshan Pan, Raymond H. Chan, Tieyong Zeng, 2024, ArXiv)
- Continuous Space-Time Video Resampling with Invertible Motion Steganography(Yuantong Zhang, Zhenzhong Chen, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Deep Space-Time Video Upsampling Networks(Jaeyeon Kang, Younghyun Jo, Seoung Wug Oh, Péter Vajda, Seon Joo Kim, 2020, ArXiv)
- SPSTT: Second-Order Propagation Spatial Temporal Transformer Network for Space-Time Video Super-Resolution(Yaping Qi, Rui Su, Lei Chen, Xianye Ben, Zheng Dong, Hongchao Zhou, 2023, Proceedings of the 2023 6th International Conference on Image and Graphics Processing)
- Continuous Space-Time Video Super-Resolution Utilizing Long-Range Temporal Information(Yuantong Zhang, Daiqin Yang, Zhenzhong Chen, Wenpeng Ding, 2023, ArXiv)
- RSTT: Real-time Spatial Temporal Transformer for Space-Time Video Super-Resolution(Z. Geng, Luming Liang, Tianyu Ding, Ilya Zharkov, 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Towards Robust and Generalizable Continuous Space-Time Video Super-Resolution with Events(Shuoyan Wei, Feng Li, Shengeng Tang, Runmin Cong, Yao Zhao, Meng Wang, Huihui Bai, 2025, ArXiv)
- Arbitrary-Scale Video Super-resolution Guided by Dynamic Context(Cong Huang, Jiahao Li, Lei Chu, Dong Liu, Yan Lu, 2024, No journal)
- Implicit Neural Representation for Video Restoration(M. Aiyetigbo, Wanqi Yuan, Feng Luo, Nianyi Li, 2025, ArXiv)
- Continuous Space-Time Video Super-Resolution with Multi-Stage Motion Information Reorganization(Yuantong Zhang, Daiqin Yang, Zhenzhong Chen, Wenpeng Ding, 2024, ACM Transactions on Multimedia Computing, Communications and Applications)
- Optical Flow Reusing for High-Efficiency Space-Time Video Super Resolution(Yuantong Zhang, Huairui Wang, Han Zhu, Zhenzhong Chen, 2021, ArXiv Preprint)
- SAVSR: Arbitrary-Scale Video Super-Resolution via a Learned Scale-Adaptive Network(Zekun Li, Hongying Liu, Fanhua Shang, Yuanyuan Liu, Liang Wan, Wei Feng, 2024, No journal)
- Transforming time and space: efficient video super-resolution with hybrid attention and deformable transformers(Linling Jiang, Xin Wang, Fan Zhang, Caiming Zhang, 2025, The Visual Computer)
- Efficient Space-time Video Super Resolution using Low-Resolution Flow and Mask Upsampling(Saikat Dutta, Nisarg A. Shah, Anurag Mittal, 2021, ArXiv Preprint)
- Learning Spatio-Temporal Downsampling for Effective Video Upscaling(Xiaoyu Xiang, Yapeng Tian, Vijay Rengarajan, Lucas D. Young, Bo Zhu, Rakesh Ranjan, 2022, No journal)
盲视频超分与真实世界复杂退化处理
针对现实场景中未知的模糊核、噪声及压缩伪影,通过自监督学习、退化建模、对比学习或元学习策略提升模型在“In-the-wild”视频中的泛化性。
- Self-Supervised Deep Blind Video Super-Resolution(Haoran Bai, Jin-shan Pan, 2022, IEEE Transactions on Pattern Analysis and Machine Intelligence)
- Blind Video Super-Resolution based on Implicit Kernels(Qiang Zhu, Yuxuan Jiang, Shuyuan Zhu, Fan Zhang, David R. Bull, Bing Zeng, 2025, ArXiv)
- Deep Blind Video Super-resolution(Jin-shan Pan, Song-Yang Cheng, Jiawei Zhang, Jinhui Tang, 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV))
- Kernel adaptive memory network for blind video super-resolution(June Yun, Min Hyuk Kim, Hyung-Il Kim, S. Yoo, 2023, Expert Syst. Appl.)
- Temporal Kernel Consistency for Blind Video Super-Resolution(Li Xiang, Royson Lee, Mohamed S. Abdelfattah, N. Lane, Hongkai Wen, 2021, 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW))
- DynaVSR: Dynamic Adaptive Blind Video Super-Resolution(Suyoung Lee, Myungsub Choi, Kyoung Mu Lee, 2020, 2021 IEEE Winter Conference on Applications of Computer Vision (WACV))
- Ada-VSR: Adaptive Video Super-Resolution with Meta-Learning(Akash Gupta, P. Jonnalagedda, B. Bhanu, A. Roy-Chowdhury, 2021, Proceedings of the 29th ACM International Conference on Multimedia)
- RealPixVSR: Pixel-Level Visual Representation Informed Super-Resolution of Real-World Videos(Tony Nokap Park, Yunho Jeon, Taeyoung Na, 2024, 2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW))
- NegVSR: Augmenting Negatives for Generalized Noise Modeling in Real-World Video Super-Resolution(Yexing Song, Meilin Wang, Zhijing Yang, Xiaoyu Xian, Yukai Shi, 2023, ArXiv Preprint)
- Expanding Synthetic Real-World Degradations for Blind Video Super Resolution(Mehran Jeelani, Sadbhawna, N. Cheema, K. Illgner-Fehns, Philipp Slusallek, S. Jaiswal, 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
- Contrastive Learning for Controllable Blind Video Restoration(Givi Meishvili, Abdelaziz Djelouah, S. Hattori, Christopher Schroers, 2022, No journal)
- Investigating Tradeoffs in Real-World Video Super-Resolution(Kelvin C. K. Chan, Shangchen Zhou, Xiangyu Xu, Chen Change Loy, 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- RealViformer: Investigating Attention for Real-World Video Super-Resolution(Yuehan Zhang, Angela Yao, 2024, ArXiv)
- Deep Video Prior for Video Consistency and Propagation(Chenyang Lei, Yazhou Xing, Hao Ouyang, Qifeng Chen, 2022, ArXiv Preprint)
- Blind Video Temporal Consistency via Deep Video Prior(Chenyang Lei, Yazhou Xing, Qifeng Chen, 2020, ArXiv Preprint)
- Blind Super Resolution of Real-Life Video Sequences(Esmaeil Faramarzi, D. Rajan, Felix C. A. Fernandes, M. Christensen, 2016, IEEE Transactions on Image Processing)
- AIM 2025 Challenge on Robust Offline Video Super-Resolution: Dataset, Methods and Results(Nikolai Karetin, Ivan Molodetskikh, Dmitry Vatolin, R. Timofte, Yixin Yang, Junyang Chen, Jiangxin Dong, Jinshan Pan, Zhihao Liu, Lishen Qu, Shihao Zhou, Jufeng Yang, Yuxuan Jiang, Siyue Teng, Chengxi Zeng, Fan Zhang, David R. Bull, Qi Tang, Jie Liu, Jie Tang, Gangshan Wu, 2025, 2025 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW))
轻量化设计与边缘端实时增强
侧重于在移动端或资源受限设备上的工程化落地,通过重参数化(Reparameterization)、剪枝、亚像素卷积和高效滑动窗口设计优化推理速度。
- RVSRT: real-time video super resolution transformer(Linlin Ou, Yuanping Chen, 2023, No journal)
- RepNet-VSR: Reparameterizable Architecture for High-Fidelity Video Super-Resolution(Biao Wu, Diankai Zhang, Shaoli Liu, Si Gao, Chengjian Zheng, Ning Wang, 2025, ArXiv Preprint)
- Sliding Window Recurrent Network for Efficient Video Super-Resolution(Wenyi Lian, Wenjing Lian, 2022, ArXiv Preprint)
- Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network(Wenzhe Shi, Jose Caballero, Ferenc Huszár, J. Totz, Andrew P. Aitken, Rob Bishop, D. Rueckert, Zehan Wang, 2016, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR))
- Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation(Jose Caballero, C. Ledig, Andrew P. Aitken, Alejandro Acosta, J. Totz, Zehan Wang, Wenzhe Shi, 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR))
- Real-Time Super-Resolution System of 4K-Video Based on Deep Learning(Yanpeng Cao, Chengcheng Wang, Changjun Song, Yongming Tang, He Li, 2021, 2021 IEEE 32nd International Conference on Application-specific Systems, Architectures and Processors (ASAP))
- FDDCC-VSR: a lightweight video super-resolution network based on deformable 3D convolution and cheap convolution(Xiaohua Wang, Xingdong Yang, Hengrui Li, Tao Li, 2024, The Visual Computer)
- Fast deblurring in video super resolution(Xueqing Yang, Tingting Fan, 2016, 2016 IEEE 13th International Conference on Signal Processing (ICSP))
- Fast image/video upsampling(Qi Shan, Zhaorong Li, Jiaya Jia, Chi-Keung Tang, 2008, ACM SIGGRAPH Asia 2008 papers)
- Low-Power Content-Based Video Acquisition for Super-Resolution Enhancement(Serene Banerjee, 2009, IEEE Transactions on Multimedia)
- Low-Resource Video Super-Resolution using Memory, Wavelets, and Deformable Convolutions(Kavitha Viswanathan, Shashwat Pathak, Piyush Bharambe, Harsh Choudhary, Amit Sethi, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
领域驱动应用:遥感、卫星、人脸与医疗影像
针对特定垂直领域的特殊挑战(如卫星视频的微小目标、人脸的身份保持、医疗内窥镜的精密要求),结合辅助信息或先验知识进行的定制化增强。
- Multi-Frame Super-Resolution of Gaofen-4 Remote Sensing Images(Jieping Xu, Yonghui Liang, Jin Liu, Zongfu Huang, 2017, Sensors (Basel, Switzerland))
- Deep Blind Super-Resolution for Satellite Video(Yi Xiao, Qiangqiang Yuan, Qiang Zhang, L. Zhang, 2024, IEEE Transactions on Geoscience and Remote Sensing)
- VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-Resolution(Liangbin Xie, Xintao Wang, Honglun Zhang, Chao Dong, Ying Shan, 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
- MedVSR: Medical Video Super-Resolution with Cross State-Space Propagation(Xinyu Liu, Guolei Sun, Cheng Wang, Yixuan Yuan, E. Konukoglu, 2025, ArXiv)
- E-SEVSR—Edge Guided Stereo Endoscopic Video Super-Resolution(Mansoor Hayat, S. Aramvith, 2024, IEEE Access)
- Blind Restoration of High-Resolution Ultrasound Video(Chu Chen, Kangning Cui, Pasquale Cascarano, Wei Tang, E. L. Piccolomini, Raymond H. Chan, 2025, No journal)
- Omnidirectional Video Super-Resolution Using Deep Learning(Arbind Agrahari Baniya, Tsz-Kwan Lee, Peter W. Eklund, Sunil Aryal, 2025, IEEE Transactions on Multimedia)
- Enhancing Lunar Reconnaissance Orbiter Images via Multi-Frame Super Resolution for Future Robotic Space Missions(J. I. Delgado-Centeno, P. Sanchez-Cuevas, Carol Martínez, M. Olivares-Méndez, 2021, IEEE Robotics and Automation Letters)
- Multi-frame Super Resolution for Ocular Biometrics(N. Reddy, Dewan Fahim Noor, Zhu Li, R. Derakhshani, 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
- Resolution enhancement of ROI from surveillance video using Bernstein interpolation(Minjae Kim, Hanseok Ko, 2011, 2011 8th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS))
多模态引导与多任务联合恢复
利用事件相机(Event Camera)、深度图等辅助传感器信息,或同时处理去噪、去模糊、去压缩伪影等多种退化任务的综合性方案。
- EvSTVSR: Event Guided Space-Time Video Super-Resolution(Haojie Yan, Zhan Lu, Zehao Chen, De Ma, Huajin Tang, Qian Zheng, Gang Pan, 2025, No journal)
- Event-Enhanced Blurry Video Super-Resolution(Dachun Kai, Yueyi Zhang, Jin Wang, Zeyu Xiao, Zhiwei Xiong, Xiaoyan Sun, 2025, No journal)
- DAVIDE: Depth-Aware Video Deblurring(German F. Torres, Jussi Kalliola, Soumya Tripathy, Erman Acar, Joni-Kristian Kämäräinen, 2024, ArXiv Preprint)
- FMA-Net: Flow-Guided Dynamic Filtering and Iterative Feature Refinement with Multi-Attention for Joint Video Super-Resolution and Deblurring(Geunhyuk Youk, Jihyong Oh, Munchurl Kim, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- FFSTIE: Video Restoration With Full-Frequency Spatio-Temporal Information Enhancement(Liqun Lin, Jianhui Wang, Guangpeng Wei, Mingxing Wang, Ang Zhang, 2025, IEEE Signal Processing Letters)
- Concurrent Video Denoising and Deblurring for Dynamic Scenes(E. Katsaros, Piotr Kopa Ostrowski, Daniel Wȩsierski, Anna Jezierska, 2021, IEEE Access)
- Video Denoising and Enhancement via Dynamic Video Layering(Han Guo, Namrata Vaswani, 2017, ArXiv Preprint)
- Reversing the Damage: A QP-Aware Transformer-Diffusion Approach for 8K Video Restoration under Codec Compression(Ali Mollaahmadi Dehaghi, Reza Razavi, Mohammad Moshirpour, 2024, 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV))
- Restoring Real-World Degraded Events Improves Deblurring Quality(Yeqing Shen, Shang Li, Kun Song, 2024, ArXiv Preprint)
- EvEnhancer: Empowering Effectiveness, Efficiency and Generalizability for Continuous Space-Time Video Super-Resolution with Events(Shuoyan Wei, Feng Li, Shengeng Tang, Yao Zhao, Huihui Bai, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
Raw 域/突发图像 (Burst) 超分与多帧融合
专注于处理相机原始(Raw)数据,利用多帧突发图像间的子像素位移进行细节恢复,强调噪声建模与Raw-to-RGB的联合转换。
- NTIRE 2021 Challenge on Burst Super-Resolution: Methods and Results(Goutam Bhat, Martin Danelljan, Radu Timofte, Kazutoshi Akita, Wooyeong Cho, Haoqiang Fan, Lanpeng Jia, Daeshik Kim, Bruno Lecouat, Youwei Li, Shuaicheng Liu, Ziluan Liu, Ziwei Luo, Takahiro Maeda, Julien Mairal, Christian Micheloni, Xuan Mo, Takeru Oba, Pavel Ostyakov, Jean Ponce, Sanghyeok Son, Jian Sun, Norimichi Ukita, Rao Muhammad Umer, Youliang Yan, Lei Yu, Magauiya Zhussip, Xueyi Zou, 2021, ArXiv Preprint)
- Multi-Frame Super-Resolution With Raw Images Via Modified Deformable Convolution(Gongzhe Li, Linwei Qiu, Haopeng Zhang, Feng-ying Xie, Zhi-guo Jiang, 2022, ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))
- Multi-Stage Raw Video Denoising with Adversarial Loss and Gradient Mask(Avinash Paliwal, Libing Zeng, Nima Khademi Kalantari, 2021, ArXiv Preprint)
- Deep Reparametrization of Multi-Frame Super-Resolution and Denoising(Goutam Bhat, Martin Danelljan, F. Yu, L. Gool, R. Timofte, 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV))
- GenMFSR: Generative Multi-Frame Image Restoration and Super-Resolution(Harshana Weligampola, Joshua Peter Ebenezer, Weidi Liu, Abhinau K. Venkataramanan, Sreenithy Chandran, Seok-Jun Lee, Hamid Rahim Sheikh, 2026, ArXiv Preprint)
- STIFS: Spatio-Temporal Input Frame Selection for Learning-based Video Super-Resolution Models(Arbind Agrahari Baniya, Tsz-Kwan Lee, P. Eklund, Sunil Aryal, 2022, No journal)
- Learnable Global Spatio-Temporal Adaptive Aggregation for Bracketing Image Restoration and Enhancement(Xinwei Dai, Yuanbo Zhou, Xintao Qiu, Hui Tang, Wei Deng, Qingquan Gao, Tong Tong, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
- Robust Multi-Frame Super-Resolution Based on Adaptive Half-Quadratic Function and Local Structure Tensor Weighted BTV(Shanshan Liu, Minghui Wang, Qingbin Huang, Xia Liu, 2021, Sensors (Basel, Switzerland))
- Multi-Frame Super-Resolution Reconstruction Based on Gradient Vector Flow Hybrid Field(Shuying Huang, Jun Sun, Yong Yang, Yuming Fang, P. Lin, 2017, IEEE Access)
- A Single-Frame and Multi-Frame Cascaded Image Super-Resolution Method(Jing Sun, Qiangqiang Yuan, Huanfeng Shen, Jie Li, Liangpei Zhang, 2024, ArXiv Preprint)
传统数学模型、优化方法与编码增强
包含基于正则化、稀疏表示、变分法等经典数学工具的方法,以及与视频编码标准(H.265/VVC)结合的预处理或后处理技术。
- Novel image restoration method based on multi-frame super-resolution for atmospherically distorted images(Yinhao Li, Katsuhisa Ogawa, Yutaro Iwamoto, Yenwei Chen, 2020, IET Image Process.)
- Video Restoration with a Deep Plug-and-Play Prior(Antoine Monod, J. Delon, Matias Tassano, Andrés Almansa, 2022, ArXiv)
- An Augmented Lagrangian Method for Total Variation Video Restoration(Stanley H. Chan, Ramsin Khoshabeh, Kristofor B. Gibson, P. Gill, Truong Q. Nguyen, 2011, IEEE Transactions on Image Processing)
- Double Sparse Multi-Frame Image Super Resolution(Toshiyuki Kato, Hideitsu Hino, Noboru Murata, 2015, ArXiv Preprint)
- Multi-Frame Super-Resolution Combining Demons Registration and Regularized Bayesian Reconstruction(Thaís Pedruzzi do Nascimento, E. Salles, 2020, IEEE Signal Processing Letters)
- Compressed Video Super-Resolution based on Hierarchical Encoding(Yuxuan Jiang, Siyue Teng, Qiang Zhu, Chen Feng, Chengxi Zeng, Fan Zhang, Shuyuan Zhu, Bing Zeng, David R. Bull, 2025, ArXiv)
- Video Compression Based on Spatio-Temporal Resolution Adaptation(Mariana Afonso, Fan Zhang, D. Bull, 2019, IEEE Transactions on Circuits and Systems for Video Technology)
- Learning-Based Scalable Video Coding with Spatial and Temporal Prediction(Martin Benjak, Yi-Hsin Chen, Wen-Hsiao Peng, Jörn Ostermann, 2023, 2023 IEEE International Conference on Visual Communications and Image Processing (VCIP))
- FedVSR: Towards Model-Agnostic Federated Learning in Video Super-Resolution(Ali Mollaahmadi Dehaghi, Hossein KhademSohi, Reza Razavi, Steve Drew, Mohammad Moshirpour, 2025, ArXiv)
- Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution(Zhongwei Qiu, Huan Yang, Jianlong Fu, Dongmei Fu, 2022, ArXiv)
- DCTResNet: Transform Domain Image Deblocking for Motion Blur Images(Paras Maharjan, N. Xu, Xuan Xu, Yuyan Song, Zhu Li, 2021, 2021 International Conference on Visual Communications and Image Processing (VCIP))
- Temporal Down-sampling based Video Coding with Frame-Recurrent Enhancement(Keren He, Chen Fu, Chi Do-Kim Pham, Lu Zhang, Jinjia Zhou, 2023, 2023 Data Compression Conference (DCC))
本报告将视频增强与超分领域的研究归纳为十个维度。技术路线经历了从经典CNN传播机制到Transformer/Mamba长程建模,再到扩散模型生成式重建的演进;应用场景从通用增强扩展到卫星遥感、人脸修复及Raw域多帧融合等垂直领域;在工程实现上,则兼顾了时空同步超分、盲退化泛化、轻量化部署以及与视频编码标准的深度融合。整体呈现出从纯像素重建向感知增强、从固定倍数向任意尺度、从实验环境向真实复杂退化场景转化的趋势。
总计221篇相关文献
Recently, several models based on deep neural networks have achieved great success in terms of both reconstruction accuracy and computational performance for single image super-resolution. In these methods, the low resolution (LR) input image is upscaled to the high resolution (HR) space using a single filter, commonly bicubic interpolation, before reconstruction. This means that the super-resolution (SR) operation is performed in HR space. We demonstrate that this is sub-optimal and adds computational complexity. In this paper, we present the first convolutional neural network (CNN) capable of real-time SR of 1080p videos on a single K2 GPU. To achieve this, we propose a novel CNN architecture where the feature maps are extracted in the LR space. In addition, we introduce an efficient sub-pixel convolution layer which learns an array of upscaling filters to upscale the final LR feature maps into the HR output. By doing so, we effectively replace the handcrafted bicubic filter in the SR pipeline with more complex upscaling filters specifically trained for each feature map, whilst also reducing the computational complexity of the overall SR operation. We evaluate the proposed approach using images and videos from publicly available datasets and show that it performs significantly better (+0.15dB on Images and +0.39dB on Videos) and is an order of magnitude faster than previous CNN-based methods.
Image diffusion models have been adapted for real-world video super-resolution to tackle over-smoothing issues in GAN-based methods. However, these models struggle to maintain temporal consistency, as they are trained on static images, limiting their ability to capture temporal dynamics effectively. Integrating text-to-video (T2V) models into video super-resolution for improved temporal modeling is straightforward. However, two key challenges remain: artifacts introduced by complex degradations in real-world scenarios, and compromised fidelity due to the strong generative capacity of powerful T2V models (\textit{e.g.}, CogVideoX-5B). To enhance the spatio-temporal quality of restored videos, we introduce\textbf{~\name} (\textbf{S}patial-\textbf{T}emporal \textbf{A}ugmentation with T2V models for \textbf{R}eal-world video super-resolution), a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency. Specifically, we introduce a Local Information Enhancement Module (LIEM) before the global attention block to enrich local details and mitigate degradation artifacts. Moreover, we propose a Dynamic Frequency (DF) Loss to reinforce fidelity, guiding the model to focus on different frequency components across diffusion steps. Extensive experiments demonstrate\textbf{~\name}~outperforms state-of-the-art methods on both synthetic and real-world datasets.
A recurrent structure is a popular framework choice for the task of video super-resolution. The state-of-the-art method BasicVSR adopts bidirectional propagation with feature alignment to effectively exploit information from the entire input video. In this study, we redesign BasicVsr by proposing second-order grid propagation and flow-guided deformable alignment. We show that by empowering the re-current framework with enhanced propagation and align-ment, one can exploit spatiotemporal information across misaligned video frames more effectively. The new components lead to an improved performance under a simi-lar computational constraint. In particular, our model Ba-sicVSR++ surpasses BasicVSR by a significant 0.82 dB in PSNR with similar number of parameters. BasicVSR++ is generalizable to other video restoration tasks, and obtains three champions and one first runner-up in NTIRE 2021 video restoration challenge.
Video super-resolution (VSR) approaches tend to have more components than the image counterparts as they need to exploit the additional temporal dimension. Complex designs are not uncommon. In this study, we wish to untangle the knots and reconsider some most essential components for VSR guided by four basic functionalities, i.e., Propagation, Alignment, Aggregation, and Upsampling. By reusing some existing components added with minimal redesigns, we show a succinct pipeline, BasicVSR, that achieves appealing improvements in terms of speed and restoration quality in comparison to many state-of-the-art algorithms. We conduct systematic analysis to explain how such gain can be obtained and discuss the pitfalls. We further show the extensibility of BasicVSR by presenting an information-refill mechanism and a coupled propagation scheme to facilitate information aggregation. The BasicVSR and its extension, IconVSR, can serve as strong baselines for future VSR approaches.
How to aggregate spatial-temporal information plays an essential role in video super-resolution (VSR) tasks. Despite the remarkable success, existing methods adopt static convolution to encode spatial-temporal information, which lacks flexibility in aggregating information in large-scale remote sensing scenes, as they often contain heterogeneous features (e.g., diverse textures). In this paper, we propose a spatial feature diversity enhancement module (SDE) and channel diversity enhancement module (CDE), which explore the diverse representation of different local patterns while aggregating the global response with compactly channel-wise embedding representation. Specifically, SDE introduces multiple learnable filters to extract representative spatial variants and encodes them to generate a dynamic kernel for enriched spatial representation. To explore the diversity in the channel dimension, CDE exploits the discrete cosine transform to transform the feature into the frequency domain. This enriches the channel representation while mitigating massive frequency loss caused by pooling operation. Based on SDE and CDE, we further devise a multi-axis feature diversity enhancement (MADE) module to harmonize the spatial, channel, and pixel-wise features for diverse feature fusion. These elaborate strategies form a novel network for satellite VSR, termed MADNet, which achieves favorable performance against state-of-the-art method BasicVSR++ in terms of average PSNR by 0.14 dB on various video satellites, including JiLin-1, Carbonite-2, SkySat-1, and UrtheCast. Code will be available at https://github.com/XY-boy/MADNet
Diffusion models have demonstrated promising performance in real-world video super-resolution (VSR). However, the dozens of sampling steps they require, make inference extremely slow. Sampling acceleration techniques, particularly single-step, provide a potential solution. Nonetheless, achieving one step in VSR remains challenging, due to the high training overhead on video data and stringent fidelity demands. To tackle the above issues, we propose DOVE, an efficient one-step diffusion model for real-world VSR. DOVE is obtained by fine-tuning a pretrained video diffusion model (i.e., CogVideoX). To effectively train DOVE, we introduce the latent-pixel training strategy. The strategy employs a two-stage scheme to gradually adapt the model to the video super-resolution task. Meanwhile, we design a video processing pipeline to construct a high-quality dataset tailored for VSR, termed HQ-VSR. Fine-tuning on this dataset further enhances the restoration capability of DOVE. Extensive experiments show that DOVE exhibits comparable or superior performance to multi-step diffusion-based VSR methods. It also offers outstanding inference efficiency, achieving up to a 28$\times$ speed-up over existing methods such as MGLD-VSR. Code is available at: https://github.com/zhengchen1999/DOVE.
Omnidirectional Videos (or 360° videos) are widely used in Virtual Reality (VR) to facilitate immersive and interactive viewing experiences. However, the limited spatial resolution in 360° videos does not allow for each degree of view to be represented with adequate pixels, limiting the visual quality offered in the immersive experience. Deep learning Video Super-Resolution (VSR) techniques used for conventional videos could provide a promising software-based solution; however, these techniques do not tackle the distortion present in equirectangular projections of 360° video signals. An additional obstacle is the limited 360° video datasets to study. To address these issues, this paper creates a novel 360° Video Dataset (360VDS) with a study of the extensibility of conventional VSR models to 360° videos. This paper further proposes a novel deep learning model for 360° Video Super-Resolution (360° VSR), called Spherical Signal Super-resolution with a Proportioned Optimisation (S3PO). S3PO adopts recurrent modelling with an attention mechanism, unbound from conventional VSR techniques like alignment. With a purpose-built feature extractor and a novel loss-function addressing spherical distortion, S3PO outperforms most state-of-the-art conventional VSR models and 360° specific super-resolution models on 360° video datasets. A step-wise ablation study is presented to understand and demonstrate the impact of the chosen architectural sub-components, targeted training and optimisation.
Video super resolution aims to generate high resolution video sequences from corresponding low resolution video sequences. Aiming at improving the insufficient utilization of temporal and spatial information of video sequences in current video super resolution methods, we proposed a new network based on deformable 3D convolutional group fusion. Input sequences were divided into groups according to different frame rates, which can effectively integrate time information in a hierarchical manner. The deformable 3D convolution was used for integration points within the good group of characteristics to keep the spatial and temporal correlation of video sequences. The introduction of time attention mechanism and group integration module provided supplementary information fusion for each group, to restore the missing details in the video sequence and generate high resolution video frames. Experimental results on Vid4 standard video data set show that The PSNR and SSIM of the generated high-resolution video frames are 27.39 and 0.8266, respectively. The network presented in this study has a good effect on the processing of motion video and has achieved better performance than current advanced methods.
Exploiting temporal correlations is crucial for video super-resolution (VSR). Recent approaches enhance this by incorporating event cameras. In this paper, we introduce MamEVSR, a Mamba-based network for event-based VSR that leverages the selective state space model, Mamba. MamEVSR stands out by offering global receptive field coverage with linear computational complexity, thus addressing the limitations of convolutional neural networks and Transformers. The key components of MamEVSR include: (1) The interleaved Mamba (iMamba) block, which interleaves tokens from adjacent frames and applies multidirectional selective state space modeling, enabling efficient feature fusion and propagation across bi-directional frames while maintaining linear complexity. (2) The cross-modality Mamba (cMamba) block facilitates further interaction and aggregation between event information and the output from the iMamba block. The cMamba block can leverage complementary spatio-temporal information from both modalities and allows MamEVSR to capture finer motion details. Experimental results show that the proposed MamEVSR achieves superior performance on various datasets quantitatively and qualitatively.
Video super-resolution (VSR) aims to restore a photo-realistic high-resolution (HR) video frame from both its corresponding low-resolution (LR) frame (reference frame) and multiple neighboring frames (supporting frames). Due to varying motion of cameras or objects, the reference frame and each support frame are not aligned. Therefore, temporal alignment is a challenging yet important problem for VSR. Previous VSR methods usually utilize optical flow between the reference frame and each supporting frame to warp the supporting frame for temporal alignment. However, both inaccurate flow and the image-level warping strategy will lead to artifacts in the warped supporting frames. To overcome the limitation, we propose a temporally-deformable alignment network (TDAN) to adaptively align the reference frame and each supporting frame at the feature level without computing optical flow. The TDAN uses features from both the reference frame and each supporting frame to dynamically predict offsets of sampling convolution kernels. By using the corresponding kernels, TDAN transforms supporting frames to align with the reference frame. To predict the HR video frame, a reconstruction network taking aligned frames and the reference frame is utilized. Experimental results demonstrate that the TDAN is capable of alleviating occlusions and artifacts for temporal alignment and the TDAN-based VSR model outperforms several recent state-of-the-art VSR networks with a comparable or even much smaller model size. The source code and pre-trained models are released in https://github.com/YapengTian/TDAN-VSR.
Diffusion models have recently advanced video restoration, but applying them to real-world video super-resolution (VSR) remains challenging due to high latency, prohibitive computation, and poor generalization to ultra-high resolutions. Our goal in this work is to make diffusion-based VSR practical by achieving efficiency, scalability, and real-time performance. To this end, we propose FlashVSR, the first diffusion-based one-step streaming framework towards real-time VSR. FlashVSR runs at approximately 17 FPS for 768x1408 videos on a single A100 GPU by combining three complementary innovations: (i) a train-friendly three-stage distillation pipeline that enables streaming super-resolution, (ii) locality-constrained sparse attention that cuts redundant computation while bridging the train-test resolution gap, and (iii) a tiny conditional decoder that accelerates reconstruction without sacrificing quality. To support large-scale training, we also construct VSR-120K, a new dataset with 120k videos and 180k images. Extensive experiments show that FlashVSR scales reliably to ultra-high resolutions and achieves state-of-the-art performance with up to 12x speedup over prior one-step diffusion VSR models. We will release the code, pretrained models, and dataset to foster future research in efficient diffusion-based VSR.
Convolutional neural networks have enabled accurate image super-resolution in real-time. However, recent attempts to benefit from temporal correlations in video super-resolution have been limited to naive or inefficient architectures. In this paper, we introduce spatio-temporal sub-pixel convolution networks that effectively exploit temporal redundancies and improve reconstruction accuracy while maintaining real-time speed. Specifically, we discuss the use of early fusion, slow fusion and 3D convolutions for the joint processing of multiple consecutive video frames. We also propose a novel joint motion compensation and video super-resolution algorithm that is orders of magnitude more efficient than competing methods, relying on a fast multi-resolution spatial transformer module that is end-to-end trainable. These contributions provide both higher accuracy and temporally more consistent videos, which we confirm qualitatively and quantitatively. Relative to single-frame models, spatio-temporal networks can either reduce the computational cost by 30% whilst maintaining the same quality or provide a 0.2dB gain for a similar computational cost. Results on publicly available datasets demonstrate that the proposed algorithms surpass current state-of-the-art performance in both accuracy and efficiency.
No abstract available
It is a challenging problem to reproduce rich spatial details while maintaining temporal consistency in real-world video super-resolution (Real-VSR), especially when we leverage pre-trained generative models such as stable diffusion (SD) for realistic details synthesis. Existing SD-based Real-VSR methods often compromise spatial details for temporal coherence, resulting in suboptimal visual quality. We argue that the key lies in how to effectively extract the degradation-robust temporal consistency priors from the low-quality (LQ) input video and enhance the video details while maintaining the extracted consistency priors. To achieve this, we propose a Dual LoRA Learning (DLoRAL) paradigm to train an effective SD-based one-step diffusion model, achieving realistic frame details and temporal consistency simultaneously. Specifically, we introduce a Cross-Frame Retrieval (CFR) module to aggregate complementary information across frames, and train a Consistency-LoRA (C-LoRA) to learn robust temporal representations from degraded inputs. After consistency learning, we fix the CFR and C-LoRA modules and train a Detail-LoRA (D-LoRA) to enhance spatial details while aligning with the temporal space defined by C-LoRA to keep temporal coherence. The two phases alternate iteratively for optimization, collaboratively delivering consistent and detail-rich outputs. During inference, the two LoRA branches are merged into the SD model, allowing efficient and high-quality video restoration in a single diffusion step. Experiments show that DLoRAL achieves strong performance in both accuracy and speed. Code and models are available at https://github.com/yjsunnn/DLoRAL.
Video super-resolution (VSR) has become even more important recently to provide high resolution (HR) contents for ultra high definition displays. While many deep learning based VSR methods have been proposed, most of them rely heavily on the accuracy of motion estimation and compensation. We introduce a fundamentally different framework for VSR in this paper. We propose a novel end-to-end deep neural network that generates dynamic upsampling filters and a residual image, which are computed depending on the local spatio-temporal neighborhood of each pixel to avoid explicit motion compensation. With our approach, an HR image is reconstructed directly from the input image using the dynamic upsampling filters, and the fine details are added through the computed residual. Our network with the help of a new data augmentation technique can generate much sharper HR videos with temporal consistency, compared with the previous methods. We also provide analysis of our network through extensive experiments to show how the network deals with motions implicitly.
This paper presents the AIM 2025 Challenge on Robust Offline Video Super-Resolution, the first challenge focusing on 4× upscaling of heavily degraded 270p videos to high-quality 1080p sequences. The challenge addresses the practical problem of enhancing low-quality video content while suppressing noise, blur, and compression artifacts under realistic hardware constraints. We introduce a comprehensive benchmark consisting of 30 diverse video clips spanning camera-shot and animated content, along with a novel synthetic degradation pipeline that ensures reproducible results. Our evaluation methodology employs subjective pairwise comparisons conducted through crowdsourcing. The challenge attracted significant participation and established new baselines for robust video super-resolution in challenging real-world scenarios.
Continuous space-time video super-resolution (C-STVSR) endeavors to upscale videos simultaneously at arbitrary spatial and temporal scales, which has recently garnered increasing interest. However, prevailing methods struggle to yield satisfactory videos at out-of-distribution spatial and temporal scales. On the other hand, event streams characterized by high temporal resolution and high dynamic range, exhibit compelling promise in vision tasks. This paper presents EvEnhancer, an innovative approach that marries the unique advantages of event streams to elevate effectiveness, efficiency, and generalizability for C-STVSR. Our approach hinges on two pivotal components: 1) Event-adapted synthesis capitalizes on the spatiotemporal correlations between frames and events to discern and learn long-term motion trajectories, enabling the adaptive interpolation and fusion of informative spatiotemporal features; 2) Local implicit video transformer integrates local implicit video neural function with cross-scale spatiotemporal attention to learn continuous video representations utilized to generate plausible videos at arbitrary resolutions and frame rates. Experiments show that EvEnhancer achieves superiority on synthetic and real-world datasets and preferable generalizability on out-of-distribution scales against state-of-the-art methods. Code is available at https://github.com/W-Shuoyan/EvEnhancer.
In this paper, we tackle the task of blurry video super-resolution (BVSR), aiming to generate high-resolution (HR) videos from low-resolution (LR) and blurry inputs. Current BVSR methods often fail to restore sharp details at high resolutions, resulting in noticeable artifacts and jitter due to insufficient motion information for deconvolution and the lack of high-frequency details in LR frames. To address these challenges, we introduce event signals into BVSR and propose a novel event-enhanced network, Ev-DeblurVSR. To effectively fuse information from frames and events for feature deblurring, we introduce a reciprocal feature deblurring module that leverages motion information from intra-frame events to deblur frame features while reciprocally using global scene context from the frames to enhance event features. Furthermore, to enhance temporal consistency, we propose a hybrid deformable alignment module that fully exploits the complementary motion information from inter-frame events and optical flow to improve motion estimation in the deformable alignment process. Extensive evaluations demonstrate that Ev-DeblurVSR establishes a new state-of-the-art performance on both synthetic and real-world datasets. Notably, on real data, our method is 2.59 dB more accurate and 7.28× faster than the recent best BVSR baseline FMA-Net.
Diffusion models have shown great potential in generating realistic image detail. However, adapting these models to video super-resolution (VSR) remains challenging due to their inherent stochasticity and lack of temporal modeling. Previous methods have attempted to mitigate this issue by incorporating motion information and temporal layers. However, unreliable motion estimation from low-resolution videos and costly multiple sampling steps with deep temporal layers limit them to short sequences. In this paper, we propose UltraVSR, a novel framework that enables ultra-realistic and temporally-coherent VSR through an efficient one-step diffusion space. A central component of UltraVSR is the Degradation-aware Reconstruction Scheduling (DRS), which estimates a degradation factor from the low-resolution input and transforms the iterative denoising process into a single-step reconstruction from low-resolution to high-resolution videos. To ensure temporal consistency, we propose a lightweight Recurrent Temporal Shift (RTS) module, including an RTS-convolution unit and an RTS-attention unit. By partially shifting feature components along the temporal dimension, it enables effective propagation, fusion, and alignment across frames without explicit temporal layers. The RTS module is integrated into a pretrained text-to-image diffusion model and is further enhanced through Spatio-temporal Joint Distillation (SJD), which improves temporally coherence while preserving realistic details. Additionally, we introduce a Temporally Asynchronous Inference (TAI) strategy to capture long-range temporal dependencies under limited memory constraints. Extensive experiments show that UltraVSR achieves state-of-the-art performance, both qualitatively and quantitatively, in a single sampling step. Code is available at https://github.com/yongliuy/UltraVSR.
We proposed a novel architecture for the problem of video super-resolution. We integrate spatial and temporal contexts from continuous video frames using a recurrent encoder-decoder module, that fuses multi-frame information with the more traditional, single frame super-resolution path for the target frame. In contrast to most prior work where frames are pooled together by stacking or warping, our model, the Recurrent Back-Projection Network (RBPN) treats each context frame as a separate source of information. These sources are combined in an iterative refinement framework inspired by the idea of back-projection in multiple-image super-resolution. This is aided by explicitly representing estimated inter-frame motion with respect to the target, rather than explicitly aligning frames. We propose a new video super-resolution benchmark, allowing evaluation at a larger scale and considering videos in different motion regimes. Experimental results demonstrate that our RBPN is superior to existing methods on several datasets.
We present a joint learning scheme of video super-resolution and deblurring, called VSRDB, to restore clean high-resolution (HR) videos from blurry low-resolution (LR) ones. This joint restoration problem has drawn much less attention compared to single restoration problems. In this paper, we propose a novel flow-guided dynamic filtering (FGDF) and iterative feature refinement with multi-attention (FRMA), which constitutes our VSRDB framework, denoted as FMA-Net. Specifically, our proposed FGDF enables precise estimation of both spatiotemporally-variant degradation and restoration kernels that are aware of motion trajectories through sophisticated motion representation learning. Compared to conventional dynamic filtering, the FGDF enables the FMA-Net to effectively handle large motions into the VSRDB. Additionally, the stacked FRMA blocks trained with our novel temporal anchor (TA) loss, which temporally anchors and sharpens features, refine features in a coarse-to-fine manner through iterative updates. Extensive experiments demonstrate the superiority of the proposed FMA-Net over state-of-the-art methods in terms of both quantitative and qualitative quality. Codes and pretrained models are available at: https://kaist-viclab.github.io/fmanetsite.
Text-based diffusion models have exhibited remarkable success in generation and editing, showing great promise for enhancing visual content with their generative prior. However, applying these models to video super-resolution remains challenging due to the high demands for output fidelity and temporal consistency, which is complicated by the inherent randomness in diffusion models. Our study introduces Upscale-A-Video, a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences; globally, without training, a flow-guided recurrent latent propagation module is introduced to enhance overall video stability by propagating and fusing latent across the entire sequences. Thanks to the diffusion paradigm, our model also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation, enabling a trade-off between fidelity and quality. Extensive experiments show that Upscale-A-Video surpasses existing methods in both synthetic and real-world benchmarks, as well as in AI-generated videos, showcasing impressive visual realism and temporal consistency.
In real-world video super-resolution (VSR), videos suffer from in-the-wild degradations and artifacts. VSR methods, especially recurrent ones, tend to propagate artifacts over time in the real-world setting and are more vulnerable than image super-resolution. This paper investigates the influence of artifacts on commonly used covariance-based attention mechanisms in VSR. Comparing the widely-used spatial attention, which computes covariance over space, versus the channel attention, we observe that the latter is less sensitive to artifacts. However, channel attention leads to feature redundancy, as evidenced by the higher covariance among output channels. As such, we explore simple techniques such as the squeeze-excite mechanism and covariance-based rescaling to counter the effects of high channel covariance. Based on our findings, we propose RealViformer. This channel-attention-based real-world VSR framework surpasses state-of-the-art on two real-world VSR datasets with fewer parameters and faster runtimes. The source code is available at https://github.com/Yuehan717/RealViformer.
The diversity and complexity of degradations in real-world video super-resolution (VSR) pose non-trivial challenges in inference and training. First, while long-term propagation leads to improved performance in cases of mild degradations, severe in-the-wild degradations could be exaggerated through propagation, impairing output quality. To balance the tradeoff between detail synthesis and artifact suppression, we found an image precleaning stage in-dispensable to reduce noises and artifacts prior to propagation. Equipped with a carefully designed cleaning module, our RealBasicVSR outperforms existing methods in both quality and efficiency (Fig. 1). Second, real-world VSR models are often trained with diverse degradations to improve generalizability, requiring increased batch size to produce a stable gradient. Inevitably, the increased computational burden results in various problems, including 1) speed-performance tradeoff and 2) batch-length trade-off. To alleviate the first tradeoff, we propose a stochastic degradation scheme that reduces up to 40% of training time without sacrificing performance. We then analyze different training settings and suggest that employing longer sequences rather than larger batches during training allows more effective uses of temporal information, leading to more stable performance during inference. To facilitate fair comparisons, we propose the new VideoLQ dataset, which contains a large variety of real-world low-quality video sequences containing rich textures and patterns. Our dataset can serve as a common ground for benchmarking. Code, models, and the dataset are publicly available at https://github.com/ckkelvinchan/RealBasicVSR.
Diffusion models are just at a tipping point for image super-resolution task. Nevertheless, it is not trivial to capitalize on diffusion models for video super-resolution which necessitates not only the preservation of visual appearance from low-resolution to high-resolution videos, but also the temporal consistency across video frames. In this paper, we propose a novel approach, pursuing Spatial Adaptation and Temporal Coherence (SATeCo), for video super-resolution. SATeCo pivots on learning spatial-temporal guidance from low-resolution videos to calibrate both latent-space high-resolution video denoising and pixel-space video reconstruction. Technically, SATeCo freezes all the parameters of the pre-trained UNet and VAE, and only optimizes two deliberately-designed spatial feature adaptation (SFA) and temporal feature alignment (TFA) modules, in the decoder of UNet and VAE. SFA modulates frame features via adaptively estimating affine parameters for each pixel, guaran-teeing pixel-wise guidance for high-resolution frame syn-thesis. TFA delves into feature interaction within a 3D local window (tube let) through self-attention, and executes cross-attention between tubelet and its low-resolution counterpart to guide temporal feature alignment. Extensive experiments conducted on the REDS4 and Vid4 datasets demonstrate the effectiveness of our approach.
Recently, Vision Transformer has achieved great success in recovering missing details in low-resolution sequences, i.e., the video super-resolution (VSR) task. Despite its su-periority in VSR accuracy, the heavy computational bur-den as well as the large memory footprint hinder the de-ployment of Transformer-based VSR models on constrained devices. In this paper, we address the above issue by proposing a novel feature-level masked processing frame-work: VSR with Masked Intra and inter-frame Attention (MIA-VSR). The core of MIA-VSR is leveraging feature-level temporal continuity between adjacent frames to re-duce redundant computations and make more rational use of previously enhanced SR features. Concretely, we propose an intra-frame and inter-frame attention block which takes the respective roles of past features and input features into consideration and only exploits previously enhanced fea-tures to provide supplementary information. In addition, an adaptive block-wise mask prediction module is developed to skip unimportant computations according to feature sim-ilarity between adjacent frames. We conduct detailed ab-lation studies to validate our contributions and compare the proposed method with recent state-of-the-art VSR approaches. The experimental results demonstrate that MIA-VSR improves the memory and computation efficiency over state-of-the-art methods, without trading off PSNR accuracy. The code is available at https://github.com/LabShuHangGU/MIA-VSR.
Video super-resolution (VSR) models achieve temporal consistency but often produce blurrier results than their image-based counterparts due to limited generative capacity. This prompts the question: can we adapt a generative image upsampler for VSR while preserving temporal consistency? We introduce VideoGigaGAN, a new generative VSR model that combines high-frequency detail with temporal stability, building on the large-scale GigaGAN image upsampler. Simple adaptations of GigaGAN for VSR led to flickering issues, so we propose techniques to enhance temporal consistency. We validate the effectiveness of VideoGigaGAN by comparing it with state-of-the-art VSR models on public datasets and showcasing video results with 8× upsampling.
Previous CNN-based video super-resolution approaches need to align multiple frames to the reference. In this paper, we show that proper frame alignment and motion compensation is crucial for achieving high quality results. We accordingly propose a “sub-pixel motion compensation” (SPMC) layer in a CNN framework. Analysis and experiments show the suitability of this layer in video SR. The final end-to-end, scalable CNN framework effectively incorporates the SPMC layer and fuses multiple frames to reveal image details. Our implementation can generate visually and quantitatively high-quality results, superior to current state-of-the-arts, without the need of parameter tuning.
Recent advances in video super-resolution have shown that convolutional neural networks combined with motion compensation are able to merge information from multiple low-resolution (LR) frames to generate high-quality images. Current state-of-the-art methods process a batch of LR frames to generate a single high-resolution (HR) frame and run this scheme in a sliding window fashion over the entire video, effectively treating the problem as a large number of separate multi-frame super-resolution tasks. This approach has two main weaknesses: 1) Each input frame is processed and warped multiple times, increasing the computational cost, and 2) each output frame is estimated independently conditioned on the input frames, limiting the system's ability to produce temporally consistent results. In this work, we propose an end-to-end trainable frame-recurrent video super-resolution framework that uses the previously inferred HR estimate to super-resolve the subsequent frame. This naturally encourages temporally consistent results and reduces the computational cost by warping only one image in each step. Furthermore, due to its recurrent nature, the proposed method has the ability to assimilate a large number of previous frames without increased computational demands. Extensive evaluations and comparisons with previous methods validate the strengths of our approach and demonstrate that the proposed framework is able to significantly outperform the current state of the art.
Deep learning-based video super-resolution (VSR) networks have gained significant performance improvements in recent years. However, existing VSR networks can only support a fixed integer scale super-resolution task, and when we want to perform VSR at multiple scales, we need to train several models. This implementation certainly increases the consumption of computational and storage resources, which limits the application scenarios of VSR techniques. In this paper, we propose a novel Scale-adaptive Arbitrary-scale Video Super-Resolution network (SAVSR), which is the first work focusing on spatial VSR at arbitrary scales including both non-integer and asymmetric scales. We also present an omni-dimensional scale-attention convolution, which dynamically adapts according to the scale of the input to extract inter-frame features with stronger representational power. Moreover, the proposed spatio-temporal adaptive arbitrary-scale upsampling performs VSR tasks using both temporal features and scale information. And we design an iterative bi-directional architecture for implicit feature alignment. Experiments at various scales on the benchmark datasets show that the proposed SAVSR outperforms state-of-the-art (SOTA) methods at non-integer and asymmetric scales. The source code is available at https://github.com/Weepingchestnut/SAVSR.
Optical-flow-based and kernel-based approaches have been extensively explored for temporal compensation in satellite Video Super-Resolution (VSR). However, these techniques are less generalized in large-scale or complex scenarios, especially in satellite videos. In this paper, we propose to exploit the well-defined temporal difference for efficient and effective temporal compensation. To fully utilize the local and global temporal information within frames, we systematically modeled the short-term and long-term temporal discrepancies since we observe that these discrepancies offer distinct and mutually complementary properties. Specifically, we devise a Short-term Temporal Difference Module (S-TDM) to extract local motion representations from RGB difference maps between adjacent frames, which yields more clues for accurate texture representation. To explore the global dependency in the entire frame sequence, a Long-term Temporal Difference Module (L-TDM) is proposed, where the differences between forward and backward segments are incorporated and activated to guide the modulation of the temporal feature, leading to a holistic global compensation. Moreover, we further propose a Difference Compensation Unit (DCU) to enrich the interaction between the spatial distribution of the target frame and temporal compensated results, which helps maintain spatial consistency while refining the features to avoid misalignment. Rigorous objective and subjective evaluations conducted across five mainstream video satellites demonstrate that our method performs favorably against state-of-the-art approaches. Code will be available at https://github.com/XY-boy/LGTD.
Integrating Stereo Imaging technology into medical diagnostics and surgeries marks a significant revolution in medical sciences. This advancement gives surgeons and physicians a deeper understanding of patients’ organ anatomy. However, like any technology, stereo cameras have their limitations, such as low resolution (LR) and output images that are often blurry. Our paper introduces a novel approach—a multi-stage network with a pioneering Stereo Endoscopic Attention Module (SEAM). This network aims to progressively enhance the quality of super-resolution (SR), moving from coarse to fine details. Specifically, we propose an edge-guided stereo attention mechanism integrated into each interaction of stereo features. This mechanism aims to capture consistent structural details across different views more effectively. Our proposed model demonstrates superior super-resolution reconstruction performance through comprehensive quantitative evaluations and experiments conducted on three datasets. Our E-SEVSR framework demonstrates superiority over alternative approaches. This framework leverages the edge-guided stereo attention mechanism within the multi-stage network, improving super-resolution quality in medical imaging applications.
In this paper, we explore the space-time video super-resolution task, which aims to generate a high-resolution (HR) slow-motion video from a low frame rate (LFR), low-resolution (LR) video. A simple solution is to split it into two sub-tasks: video frame interpolation (VFI) and video super-resolution (VSR). However, temporal interpolation and spatial super-resolution are intra-related in this task. Two-stage methods cannot fully take advantage of the natural property. In addition, state-of-the-art VFI or VSR networks require a large frame-synthesis or reconstruction module for predicting high-quality video frames, which makes the two-stage methods have large model sizes and thus be time-consuming. To overcome the problems, we propose a one-stage space-time video super-resolution framework, which directly synthesizes an HR slow-motion video from an LFR, LR video. Rather than synthesizing missing LR video frames as VFI networks do, we firstly temporally interpolate LR frame features in missing LR video frames capturing local temporal contexts by the proposed feature temporal interpolation network. Then, we propose a deformable ConvLSTM to align and aggregate temporal information simultaneously for better leveraging global temporal contexts. Finally, a deep reconstruction network is adopted to predict HR slow-motion video frames. Extensive experiments on benchmark datasets demonstrate that the proposed method not only achieves better quantitative and qualitative performance but also is more than three times faster than recent two-stage state-of-the-art methods, e.g., DAIN+EDVR and DAIN+RBPN.
Video super-resolution techniques aim to obtain high-resolution equivalents of existing low-resolution videos through a series of operations. In recent research, transformers have been increasingly popular because of their remarkable abilities in parallel computing and efficient extraction of space-time sequence features from videos. Moreover, combining self-attention and multi-scale methods has yielded excellent results. However, the combination of the two methods has limitations, current up-sampling methods struggle to match the global modeling capacity of self-attention mechanisms. Therefore, this paper proposes three strategies to combine the two methods. Based on the approximation strategy, we first construct a new bilinear up-sampling method for multi-scale acquisition. Convolution and cross-attention techniques are then used to correct and align features at different scales to prevent large deviations in feature extraction at a specific scale, which can affect subsequent feature extraction. Finally, to effectively solve the common computational complexity, $ C^{0} $ continuity, and neuron death problems of existing activation functions, a new method to construct the activation function is proposed. The cubic spline function is used to construct a new activation function approximating tanh. The new activation function is $ C^{2} $ continuous, which is piecewise defined by cubic polynomial curves. In this study, better results were achieved on three public video super-resolution test sets: REDS4, Vid4, and Vimeo-90K-T. Experiments demonstrated that the proposed method could provide a new solution for video super-resolution tasks.
Video super-resolution (VSR) is used to compose high-resolution (HR) video from low-resolution video. Recently, the deformable alignment-based VSR methods are becoming increasingly popular. In these methods, the features extracted from video are aligned to eliminate the motion error targeting high super-resolution (SR) quality. However, these methods often suffer from misalignment and the lack of enough temporal information to compose HR frames, which accordingly induce artifacts in the SR result. In this article, we design a deep VSR network (DVSRNet) based on the proposed progressive deformable alignment (PDA) module and temporal-sparse enhancement (TSE) module. Specifically, the PDA module is designed to accurately align features and to eliminate artifacts via the bidirectional information propagation. The TSE module is constructed to further eliminate artifacts and to generate clear details for the HR frame. In addition, we construct a lightweight deep optical flow network (OFNet) to obtain the bidirectional optical flows for the implementation of the PDA module. Moreover, two new loss functions are designed for our proposed method. The first one is adopted in OFNet and the second one is constructed to guarantee the generation of sharp and clear details for the HR frames. The experimental results demonstrate that our method performs better than the state-of-the-art methods.
Event cameras are novel bio-inspired cameras that record asynchronous events with high temporal resolution and dynamic range. Leveraging the auxiliary temporal information recorded by event cameras holds great promise for the task of video super-resolution (VSR). However, existing event-guided VSR methods assume that the event and RGB cameras are strictly calibrated (e.g., pixel-level sensor designs in DAVIS 240/346). This assumption proves limiting in emerging high-resolution devices, such as dual-lens smartphones and unmanned aerial vehicles, where such precise calibration is typically unavailable. To unlock more event-guided application scenarios, we perform the task of asymmetric event-guided VSR for the first time, and we propose an Asymmetric Event-guided VSR Network (AsEVSRN) for this new task. AsEVSRN incorporates two specialized designs for leveraging the asymmetric event stream in VSR. Firstly, the content hallucination module dynamically enhances event and RGB information by exploiting their complementary nature, thereby adaptively boosting representational capacity. Secondly, the event-enhanced bidirectional recurrent cells align and propagate temporal features fused with features from content-hallucinated frames. Within the bidirectional recurrent cells, event-enhanced flow is employed to simultaneously utilize and fuse temporal information at both the feature and pixel levels. Comprehensive experimental results affirm that our method consistently generates superior quantitative and qualitative results.
Video super-resolution (VSR) aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Although some progress has been made, there are grand challenges to effectively utilize temporal dependency in entire video sequences. Existing approaches usually align and aggregate video frames from limited adjacent frames (e.g., 5 or 7 frames), which prevents these approaches from satisfactory results. In this paper, we take one step further to enable effective spatio-temporal learning in videos. We propose a novel Trajectory-aware Transformer for Video Super-Resolution (TTVSR). In particular, we formulate video frames into several pre-aligned trajectories which consist of continuous visual tokens. For a query token, self-attention is only learned on relevant visual tokens along spatio-temporal trajectories. Compared with vanilla vision Transformers, such a design significantly reduces the computational cost and enables Transformers to model long-range features. We further propose a cross-scale feature tokenization module to over-come scale-changing problems that often occur in long-range videos. Experimental results demonstrate the superiority of the proposed TTVSR over state-of-the-art models, by extensive quantitative and qualitative evaluations in four widely-used video super-resolution benchmarks. Both code and pre-trained models can be downloaded at https://github.com/researchmm/TTVSR.
The alignment of adjacent frames is considered an essential operation in video super-resolution (VSR). Advanced VSR models, including the latest VSR Transformers, are generally equipped with well-designed alignment modules. However, the progress of the self-attention mechanism may violate this common sense. In this paper, we rethink the role of alignment in VSR Transformers and make several counter-intuitive observations. Our experiments show that: (i) VSR Transformers can directly utilize multi-frame information from unaligned videos, and (ii) existing alignment methods are sometimes harmful to VSR Transformers. These observations indicate that we can further improve the performance of VSR Transformers simply by removing the alignment module and adopting a larger attention window. Nevertheless, such designs will dramatically increase the computational burden, and cannot deal with large motions. Therefore, we propose a new and efficient alignment method called patch alignment, which aligns image patches instead of pixels. VSR Transformers equipped with patch alignment could demonstrate state-of-the-art performance on multiple benchmarks. Our work provides valuable insights on how multi-frame information is used in VSR and how to select alignment methods for different networks/datasets. Codes and models will be released at https://github.com/XPixelGroup/RethinkVSRAlignment.
Recently, due to higher requirements for satellite video resolution, video super-resolution (VSR) has been extensively studied. However, the following problems have not been effectively resolved: 1) previous satellite VSR methods cannot achieve continuous-scale (integer and noninteger scale) VSR with a single model; 2) satellite video has complex ground and weak textures, which increases the difficulty of capturing motion information. In addition, existing methods adopt a unified alignment path, which leads to a drop in feature alignment accuracy; and 3) during feature fusion, previous methods ignore the correlation of spatiotemporal information in satellite video and cannot make full use of the spatiotemporal information. To address the above problems, in this article, we propose a novel network for continuous-scale satellite VSR (CSVSR). Specifically, first, for effective motion capture and accurate feature alignment, we design a residual-guided and time-aware dynamic routing alignment module, which can use feature residuals to lock motion areas and then dynamically select the corresponding alignment path based on the temporal distance. Second, we proposed a nonlocal mask-based feature fusion module (NMFFM) to exploit the correlation of the spatiotemporal features and complete effective spatiotemporal feature fusion. Third, to make our network adapt to multitask learning, we develop a scale-aware convolutional (SA-Conv) layer, which lets our network dynamically extract scale-adaptive features according to the input scale factors. Finally, we propose a continuous-scale upsampling module with a global feature implicit function (GFIF), which can achieve continuous-scale mapping from features to pixel values. In addition, we carefully design a novel training strategy to optimize our network. Comprehensive experiments verify that the proposed CSVSR has superior reconstruction performance on continuous-scale factors. The code will be available at https://github.com/chongningni/CSVSR.
Intelligent processing and analysis of satellite video has become one of the research hotspots in the representation of remote sensing, and satellite video super-resolution (SVSR) is an important research direction, which can improve the image quality of satellite video. However, existing approaches for SVSR often underutilize a notable advantage inherent to satellite video, the presence of extensive sequential imagery capturing a consistent scene. Presently, the majority of SVSR methods merely harness a limited number of adjacent frames for enhancing the resolution of individual frames, thus resulting in suboptimal information utilization. In response, we introduce the recurrent aggregation network for satellite video superresolution (RASVSR). This innovative framework leverages a bidirectional recurrent neural network to propagate extracted features from each frame across the entire video sequence. It relies on an alignment method based on optical flow and deformable convolution (DCN) to realize the alignment of the features, and a temporal feature fusion module to realize effective feature fusion over time. Notably, our research underscores the positive influence of employing lengthier image sequences in SVSR. In the context of RASVSR, with better alignment and fusion, we make the perceptual field of each frame spanning 100 frames of the video, thus, acquiring richer information, and information between different images can be complementary. This strategic approach culminates in superior performance compared with alternative methods, as evidenced by a noteworthy 1.15 dB improvement in PSNR, with very few parameters.
Video super-resolution (VSR) is an important area in computer vision, aimed at reconstructing high-resolution video frames from low-resolution inputs. The task is especially challenging due to the presence of noise, motion blur, and the need to maintain temporal consistency. In this work, we present a comparative analysis of deep learning-based VSR techniques, with a particular focus on our proposed method, FRVSRGAN — a hybrid model that integrates Frame-Recurrent Video Super-Resolution (FRVSR) and Super-Resolution Generative Adversarial Networks (SRGAN). The model contributes to improved perceptual fidelity, by incorporation of optical flow and super-resolution networks together thereby producing high-resolution outputs that are visually sharper and more realistic. Our evaluation focuses on infrared video sequences, which are inherently more difficult to process due to limited resolution, higher noise levels, and the scarcity of proper training data. We incorporate both structural and perceptual no-reference metrics like PIQE and BRISQUE for measuring the effectiveness of the proposed model.
To address the problems in the existing video super-resolution methods, such as noise, over smooth and visual artifacts, which are caused by the reliance on limited external training or mismatch of internal similarity patch instances, this study proposes a novel video super-resolution reconstruction algorithm based on deep learning and spatio-temporal feature similarity (DLSS-VSR). The video super-resolution reconstruction mechanism with the joint internal and external constraints is established utilizing the complementary advantages of both external deep correlation mapping learning and internal spatio-temporal nonlocal self-similarity prior constraint. A deep learning model based on deep convolutional neural network is constructed to learn the nonlinear correlation mapping between low-resolution and high-resolution video frame patches. A novel spatio-temporal feature similarity calculation method is proposed, which considers both internal video spatio-temporal self-similarity and external clean nonlocal similarity. For the internal spatio-temporal feature self-similarity, we improve the accuracy and robustness of similarity matching by proposing a similarity measure strategy based on spatio-temporal moment feature similarity and structural similarity. The external nonlocal similarity prior constraint is learned by the patch group-based Gaussian mixture model. The time efficiency for spatio-temporal similarity matching is further improved based on saliency detection and region correlation judgment strategy, which achieves a better tradeoff between super-resolution accuracy and speed. Experimental results demonstrate that the DLSS-VSR algorithm achieves competitive super-resolution quality compared to other state-of-the-art algorithms in both subjective and objective evaluations.
Video super-resolution (VSR) technology excels in reconstructing low-quality video, avoiding unpleasant blur effect caused by interpolation-based algorithms. However, vast computation complexity and memory occupation hampers the edge of deplorability and the runtime inference in real-life applications, especially for large-scale VSR task. This paper explores the possibility of real-time VSR system and designs an efficient and generic VSR network, termed EGVSR. The proposed EGVSR is based on spatio-temporal adversarial learning for temporal coherence. In order to pursue faster VSR processing ability up to 4K resolution, this paper tries to choose lightweight network structure and efficient upsampling method to reduce the computation required by EGVSR network under the guarantee of high visual quality. Besides, we implement the batch normalization computation fusion, convolutional acceleration algorithm and other neural network acceleration techniques on the actual hardware platform to optimize the inference process of EGVSR network. Finally, our EGVSR achieves the real-time processing capacity of 4K@29.61FPS. Compared with TecoGAN, the most advanced VSR network at present, we achieve 85.04% reduction of computation density and 7.92× performance speedups. In terms of visual quality, the proposed EGVSR tops the list of most metrics (such as LPIPS, tOF, tLP, etc.) on the public test dataset Vid4 and surpasses other state-of-the-art methods in overall performance score.
Most of the existing works in supervised spatio-temporal video super-resolution (STVSR) heavily rely on a large-scale external dataset consisting of paired low-resolution low-frame rate (LR-LFR) and high-resolution high-frame-rate (HR-HFR) videos. Despite their remarkable performance, these methods make a prior assumption that the low-resolution video is obtained by down-scaling the high-resolution video using a known degradation kernel, which does not hold in practical settings. Another problem with these methods is that they cannot exploit instance-specific internal information of a video at testing time. Recently, deep internal learning approaches have gained attention due to their ability to utilize the instance-specific statistics of a video. However, these methods have a large inference time as they require thousands of gradient updates to learn the intrinsic structure of the data. In this work, we present Adaptive VideoSuper-Resolution (Ada-VSR) which leverages external, as well as internal, information through meta-transfer learning and internal learning, respectively. Specifically, meta-learning is employed to obtain adaptive parameters, using a large-scale external dataset, that can adapt quickly to the novel condition (degradation model) of the given test video during the internal learning task, thereby exploiting external and internal information of a video for super-resolution. The model trained using our approach can quickly adapt to a specific video condition with only a few gradient updates, which reduces the inference time significantly. Extensive experiments on standard datasets demonstrate that our method performs favorably against various state-of-the-art approaches.
No abstract available
Video super-resolution (VSR) aims to enhance low-resolution videos by leveraging both spatial and temporal information. While deep learning has led to impressive progress, it typically requires centralized data, which raises privacy concerns. Federated learning (FL) offers a privacy-friendly solution, but general FL frameworks often struggle with low-level vision tasks, resulting in blurry, low-quality outputs. To address this, we introduce FedVSR, the first FL framework specifically designed for VSR. It is model-agnostic and stateless, and introduces a lightweight loss function based on the Discrete Wavelet Transform (DWT) to better preserve high-frequency details during local training. Additionally, a loss-aware aggregation strategy combines both DWT-based and task-specific losses to guide global updates effectively. Extensive experiments across multiple VSR models and datasets show that FedVSR not only improves perceptual video quality (up to +0.89 dB PSNR, +0.0370 SSIM, -0.0347 LPIPS and 4.98 VMAF) but also achieves these gains with close to zero computation and communication overhead compared to its rivals. These results demonstrate FedVSR's potential to bridge the gap between privacy, efficiency, and perceptual quality, setting a new benchmark for federated learning in low-level vision tasks. The code is available at: https://github.com/alimd94/FedVSR
High-resolution high-frame-rate videos can record motion scenes detailedly and smoothly, but usually only professional cameras have enough transmission bandwidth to meet the video capture requirement. The conventional solutions use video processing methods such as video super-resolution (VSR) and video frame interpolation (VFI), but their results suffer from unreal spatial-temporal details in complex dynamic cases. To address this problem, we reconstruct a more real high-resolution high-frame-rate video using a hybrid video input, including a low-resolution high-frame-rate video (main video) and a high-resolution low-frame-rate video (auxiliary video). We propose a deep learning model named HIS-VSR, which consists of three parts: super-resolution of the main video, detail feature extraction of the auxiliary video and hybrid video information aggregation. Among them, the first part processes the main video to generate preliminary high-resolution frames; the second part warps the auxiliary frames for alignment and extracts their high-resolution detail features; the last part uses a weighted aggregation method to fuse the results of the first and second part. We train our model on synthetic datasets and demonstrate its excellent performance of reconstructing dynamic scenes by comparing it with Deep-SloMo on synthetic and real videos.
Video super-resolution (VSR) and frame interpolation (FI) are traditional computer vision problems, and the performance have been improving by incorporating deep learning recently. In this paper, we investigate the problem of jointly upsampling videos both in space and time, which is becoming more important with advances in display systems. One solution for this is to run VSR and FI, one by one, independently. This is highly inefficient as heavy deep neural networks (DNN) are involved in each solution. To this end, we propose an end-to-end DNN framework for the space-time video upsampling by efficiently merging VSR and FI into a joint framework. In our framework, a novel weighting scheme is proposed to fuse input frames effectively without explicit motion compensation for efficient processing of videos. The results show better results both quantitatively and qualitatively, while reducing the computation time (x7 faster) and the number of parameters (30%) compared to baselines.
: Deep learning Video Super-Resolution (VSR) methods rely on learning spatio-temporal correlations between a target frame and its neighbouring frames in a given temporal radius to generate a high-resolution output. Among recent VSR models, a sliding window mechanism is popularly adopted by picking a fixed number of consecutive frames as neighbouring frames for a given target frame. This results in a single frame being used multiple times in the input space during the super-resolution process. Moreover, the approach of adopting the fixed consecutive frames directly does not allow deep learning models to learn the full extent of spatio-temporal inter-dependencies between a target frame and its neighbours along a video sequence. To mitigate these issues, this paper proposes a Spatio-Temporal Input Frame Selection (STIFS) algorithm based on image analysis to adaptively select the neighbouring frame(s) based on the spatio-temporal context dynamics with respect to the target frame. STIFS is first-ever dynamic selection mechanism proposed for VSR methods. It aims to enable VSR models to better learn spatio-temporal correlations in a given temporal radius and consequently maximise the quality of the high-definition output. The proposed STIFS algorithm achieved remarkable PSNR improvements in the high-resolution output for VSR models on benchmark datasets.
Video super-resolution (VSR) problem has gained a soaring development along with deep learning methods. However, the further progress requires the blessing of more complex architectures. Unlike them, this paper promotes VSR performance from a new perspective of sample difficulty. We propose an exclusive curriculum learning strategy for VSR, which can improve the representation power without noticeable computation increment. Specifically, this paper memorizes the performance track of every sample and calculate a customized weight for each sample according to it. In this way, the model can automatically concentrate on the easy samples first and gradually focus on the hard ones. Experimental analysis on training process and benchmark datasets demonstrate that our method can substantially boost the performance with a superior convergence speed and a limited number of parameters.
Video super-resolution (VSR) aims to reconstruct a sequence of high-resolution (HR) images from their corresponding low-resolution (LR) versions. Traditionally, solving a VSR problem has been based on iterative algorithms that can exploit prior knowledge on image formation and assumptions on the motion. However, these classical methods struggle at incorporating complex statistics from natural images. Furthermore, VSR has recently benefited from the improvement brought by deep learning (DL) algorithms. These techniques can efficiently learn spatial patterns from large collections of images. Yet, they fail to incorporate some knowledge about the image formation model, which limits their flexibility.Unrolled optimization algorithms, developed for inverse problems resolution, allow to include prior information into deep learning architectures. They have been used mainly for single image restoration tasks. Adapting an unrolled neural network structure can bring the following benefits. First, this may increase performance of the super-resolution task. Then, this gives neural networks better interpretability. Finally, this allows flexibility in learning a single model to nonblindly deal with multiple degradations.In this paper, we propose a new VSR neural network based on unrolled optimization techniques and discuss its performance.
Generative deep learning has sparked a new wave of Super-Resolution (SR) algorithms that enhance single images with impressive aesthetic results, albeit with imaginary details. Multi-frame Super-Resolution (MFSR) offers a more grounded approach to the ill-posed problem, by conditioning on multiple low-resolution views. This is important for satellite monitoring of human impact on the planet -- from deforestation, to human rights violations -- that depend on reliable imagery. To this end, we present HighRes-net, the first deep learning approach to MFSR that learns its sub-tasks in an end-to-end fashion: (i) co-registration, (ii) fusion, (iii) up-sampling, and (iv) registration-at-the-loss. Co-registration of low-resolution views is learned implicitly through a reference-frame channel, with no explicit registration mechanism. We learn a global fusion operator that is applied recursively on an arbitrary number of low-resolution pairs. We introduce a registered loss, by learning to align the SR output to a ground-truth through ShiftNet. We show that by learning deep representations of multiple views, we can super-resolve low-resolution signals and enhance Earth Observation data at scale. Our approach recently topped the European Space Agency's MFSR competition on real-world satellite imagery.
We propose a deep reparametrization of the maximum a posteriori formulation commonly employed in multi-frame image restoration tasks. Our approach is derived by introducing a learned error metric and a latent representation of the target image, which transforms the MAP objective to a deep feature space. The deep reparametrization allows us to directly model the image formation process in the latent space, and to integrate learned image priors into the prediction. Our approach thereby leverages the advantages of deep learning, while also benefiting from the principled multi-frame fusion provided by the classical MAP formulation. We validate our approach through comprehensive experiments on burst denoising and burst super-resolution datasets. Our approach sets a new state-of-the-art for both tasks, demonstrating the generality and effectiveness of the proposed formulation.
Smartphone cameras have become ubiquitous imaging tools, yet their small sensors and compact optics often limit spatial resolution and introduce distortions. Combining information from multiple low-resolution (LR) frames to produce a high-resolution (HR) image has been explored to overcome the inherent limitations of smartphone cameras. Despite the promise of multi-frame super-resolution (MFSR), current approaches are hindered by datasets that fail to capture the characteristic noise and motion patterns found in real-world handheld burst images. In this work, we address this gap by introducing a novel synthetic data engine that uses multi-exposure static images to synthesize LR-HR training pairs while preserving sensor-specific noise characteristics and image motion found during handheld burst photography. We also propose MFSR-GAN: a multi-scale RAW-to-RGB network for MFSR. Compared to prior approaches, MFSR-GAN emphasizes a “base frame” throughout its architecture to mitigate artifacts. Experimental results on both synthetic and real data demonstrates that MFSR-GAN trained with our synthetic engine yields sharper, more realistic reconstructions than existing methods for real-world MFSR.
With the continuous advancement of space technology, the number of defunct spacecraft, abandoned rocket bodies, and debris in space is increasing. These non-cooperative objects occupy a significant amount of orbital resources and pose a substantial threat to the safety of on-orbit spacecraft. This paper focuses on close-proximity operations in space and aims to address the limitation of camera resolution by proposing an optical flow-based multi-frame super-resolution reconstruction algorithm. This algorithm employs a multi-level wavelet convolutional network (MWCNN) for feature extraction and uses SpyNet to obtain multi-level optical flow between different frames. The multi-level optical flow pyramid alignment network is used to align features, and a recurrent network is utilized for frame-by-frame feature fusion. Finally, a reconstruction network generates high-resolution images. Extensive experiments have demonstrated that our proposed method effectively enhances the perception capabilities of space non-cooperative objects.
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
This paper presents a novel application of a Multi-frame Super Resolution (MFSR) method for lunar surface imagery called Lunar HighRes-net (L-HRN). In this work, we adapted and used NASA's Lunar Reconnaissance Orbiter (LRO) image database to train the Deep Learning architecture for image super resolution. Additionally, we also gathered an artificial image dataset from our virtual Moon to improve the amount of input data in the neural network training process. The network's architecture follows a standard MFSR algorithm that was enhanced for this specific use case. The proposed MFSR method has been evaluated using the well-known peak signal-to-noise ratio (PSNR) metric against other generic super-resolution methods of the state of the art. This work aims to improve environmental knowledge about the lunar surface to enhance future autonomous robots capabilities on the surface of the Moon.
Satellite image super resolution is an important task that generates high resolution satellite images from low resolution inputs. Multi-frame super resolution utilizes multiple low-resolution images to generate a single high-resolution image. Multi-frame super resolution methods face difficulty in handling spatial and temporal dependencies of pixels. In this work, we proposed a novel architecture named Multi-context Dense Network (MCDNet) to handle spatial and temporal pixel dependencies using multiple approaches of global average pooling, multiple size kernels, and self-attention. The proposed approach improved the PSNR values by 0.29 % and 0.001 % for super resolution of NIR and RED bands on the benchmark PROBA-V dataset.
The small satellite market continues to grow year after year. A compound annual growth rate of 17% is estimated during the period between 2020 and 2025. Low-cost satellites can send a vast amount of images to be post-processed at the ground to improve the quality and extract detailed information. In this domain lies the resolution enhancement task, where a low-resolution image is converted to a higher resolution automatically. Deep learning approaches to Super Resolution (SR) reached the state-of-the-art in multiple benchmarks; however, most of them were studied in a single-frame fashion. With satellite imagery, multi-frame images can be obtained at different conditions giving the possibility to add more information per image and improve the final analysis. In this context, we developed and applied to the PROBA-V dataset of multi-frame satellite images a model that recently topped the European Space Agency’s Multi-frame Super Resolution (MFSR) competition. The model is based on proven methods that worked on 2D images tweaked to work on 3D: the Wide Activation Super Resolution (WDSR) family. We show that with a simple 3D CNN residual architecture with WDSR blocks and a frame permutation technique as the data augmentation, better scores can be achieved than with more complex models. Moreover, the model requires few hardware resources, both for training and evaluation, so it can be applied directly on a personal laptop.
No abstract available
No abstract available
Image super-resolution reconstruction has been widely used in remote sensing, medicine and other fields. In recent years, due to the rise of deep learning research and the successful application of convolutional neural networks in the image field, the super-resolution reconstruction technology based on deep learning has also achieved great development. However, there are still some problems that need to be solved. For example, the current mainstream image super-resolution algorithms based on single or multiple frames pursue high performance indicators such as PSNR and SSIM, while the reconstructed image is relatively smooth and lacks many high-frequency details. It is not conducive to application in a real environment. To address such problem, this paper proposes a super-resolution reconstruction model of sequential images based on Generative Adversarial Networks (GAN). The proposed approach combines the registration module to fuse adjacent frames, effectively use the detailed information in multiple consecutive frames, and enhances the spatio-temporality of low-resolution images in sequential images. While the GAN was used to improve the effect of image high-frequency texture detail reconstruction, WGAN was introduced to optimize model training. The reconstruction results not only improved the PSNR and SSIM indexes but also reconstructed more high-frequency detail textures. Finally, in order to further improve the perception effect, an additional registration loss item RLT is introduced in the GAN network perception loss. Through extensive experiments, it shows that the model proposed in this paper effectively obtains the information between the sequence images. When the PSNR and SSIM indicators are improve, it can reconstruct better high-frequency texture details than the current advanced multi-frame algorithms.
No abstract available
No abstract available
In this paper, we propose a novel multi-frame super-resolution (SR) method, which is developed by considering image enhancement and denoising into the SR processing. For image enhancement, a gradient vector flow hybrid field (GVFHF) algorithm, which is robust to noise is first designed to capture the image edges more accurately. Then, through replacing the gradient of anisotropic diffusion shock filter (ADSF) by GVFHF, a GVFHF-based ADSF (GVFHF-ADSF) model is proposed, which can effectively achieve image denoising and enhancement. In addition, a difference curvature-based spatial weight factor is defined in the GVFHF-ADSF model to obtain an adaptive weight between denoising and enhancement in the flat and edge regions. Finally, a GVFHF-ADSF-based multi-frame SR method is presented by employing the GVFHF-ADSF model as a regularization term and the steepest descent algorithm is adopted to solve the inverse SR problem. Experimental results and comparisons with existing methods demonstrate that the proposed GVFHF-ADSF-based SR algorithm can effectively suppress both Gaussian and salt-and-pepper noise, meanwhile enhance edges of the reconstructed image.
This letter proposes a multi-frame super-resolution framework by combining Demons registration with Bayesian-based regularized reconstruction. For both of our proposals, D-BTVIR and D-IRWIR, the visual analysis shows improvements in regions where the compared methods produced results with either motion-artifacts or over-smoothed aspect due to misregistration. Quantitative results on simulated deformations show an improvement of 3.5%, on average, in PSNR and 8.0%, in SSIM. Finally, results from the Nemenyi test show that D-IRWIR is statistically superior to the other methods we tested, both considering SSIM and PSNR, and D-BTVIR is statistically equivalent while being 7 times faster than the state-of-the-art Bayesian method.
In this study, the authors propose a novel multi-frame super-resolution method using frame selection and multiple fusions for atmospherically distorted, zoomed-in, image-quality enhancement. When a small part of the image captured by placing a target several kilometres away from the fixed camera is enlarged, the quality of the part becomes poor owing to low resolution, spatial deformations and noise that are mainly caused by long distance and atmospheric turbulence. Thus, the authors propose an adaptive frame selection method that selects only a few frames with small blur based on the corresponding images with relatively clear edges. Further, they propose multiple fusion schemes to reconstruct the selected frames, thereby suppressing the influence of deformation. By converting all the frames into high-resolution based on each frame and integrating them, deformation and noise are effectively removed without high computation cost using the multiple fusion scheme. The proposed method, which enhances the quality of atmospherically distorted zoomed-in images, exhibits superior performance than the state-of-the-art image super-resolution methods with regard to high accuracy, efficiency and ease of implementation, ensuring that the proposed method is suitable for enhancing the quality of an image captured using a general digital camera or a smartphone.
Since the birth of convolutional neural networks, the application of deep learning technology in image processing has been booming, and deep learning super-resolution technology is one of the most concerned fields. In the traditional deep learning super-resolution process, the conversion of high-resolution images to low-resolution images is usually obtained by down sampling, but when the actual image degradation does not conform to this process, the effect of the model is usually greatly reduced. Currently, single-frame input is mainly used for image super-resolution, but this operation usually leads to undesirable results in large-scale reconstruction. This article is derived from the SRMD network (a single convolutional super-resolution network with multiple degradations). On this basis, the key factors of image degradation (blur kernel and noise level) are added to the input of the model, and the measurement matrix commonly used in compressed sensing is used to generate multi-frame images. We invented the MFSR network (Multi-Frames Input Super-Resolution Network with Multiple Degradations), and achieved excellent results on the target data set.
In the field of computer vision, image super-resolution is a difficult task, which have many applications in remote sensing, military and so on. In this paper, we introduce the convolutional block attention module (CBAM) into super-resolution problem, proposing a novel method called multi-frame super-resolution (MFSR) algorithm based on attention mechanism. Our proposed MFSR algorithm uses a three-layer CNN as benchmark and cascades CBAM at the end of each CNN block. The proposed algorithm can deliver a high-resolution output corresponding to the center (3rd) input frame. The average PSNR and SSIM of our algorithm are 33.318dB and 0.906 respectively, which outperform other MFSR algorithm.
It is difficult to improve image resolution in hardware due to the limitations of technology and too high costs, but most application fields need high resolution images, so super-resolution technology has been produced. This paper mainly uses information redundancy to realize multi-frame super-resolution. In recent years, many researchers have proposed a variety of multi-frame super-resolution methods, but it is very difficult to preserve the image edge and texture details and remove the influence of noise effectively in practical applications. In this paper, a minimum variance method is proposed to select the low resolution images with appropriate quality quickly for super-resolution. The half-quadratic function is used as the loss function to minimize the observation error between the estimated high resolution image and low-resolution images. The function parameter is determined adaptively according to observation errors of each low-resolution image. The combination of a local structure tensor and Bilateral Total Variation (BTV) as image prior knowledge preserves the details of the image and suppresses the noise simultaneously. The experimental results on synthetic data and real data show that our proposed method can better preserve the details of the image than the existing methods.
No abstract available
Gaofen-4 is China’s first geosynchronous orbit high-definition optical imaging satellite with extremely high temporal resolution. The features of staring imaging and high temporal resolution enable the super-resolution of multiple images of the same scene. In this paper, we propose a super-resolution (SR) technique to reconstruct a higher-resolution image from multiple low-resolution (LR) satellite images. The method first performs image registration in both the spatial and range domains. Then the point spread function (PSF) of LR images is parameterized by a Gaussian function and estimated by a blind deconvolution algorithm based on the maximum a posteriori (MAP). Finally, the high-resolution (HR) image is reconstructed by a MAP-based SR algorithm. The MAP cost function includes a data fidelity term and a regularized term. The data fidelity term is in the L2 norm, and the regularized term employs the Huber-Markov prior which can reduce the noise and artifacts while preserving the image edges. Experiments with real Gaofen-4 images show that the reconstructed images are sharper and contain more details than Google Earth ones.
Multi-Frame Super Resolution with Deep Residual Learning on Flow Registered Non-Integer Pixel Images
Super-Resolution (SR) of low-quality images is an important topic of research in image processing and computer vision field. Using multi-frame, super-resolution algorithm can reconstruct high-resolution images by incorporating the information of the subsequent images. Most of the super-resolution techniques for multi-frames either use a more traditional or mathematical approach or deep learning based approach with optical flow in consideration. In this paper, we develop a way to combine the optical flow enabled sub-pixel registration method for mapping into the high-resolution grid and a deep residual learning approach for restoring features with noise removal. The results exhibit a significant gain over the state of art methods and the bi-cubic interpolation method.
The optical resolution of a digital camera is one of its most crucial parameters with broad relevance for consumer electronics, surveillance systems, remote sensing, or medical imaging. However, resolution is physically limited by the optics and sensor characteristics. In addition, practical and economic reasons often stipulate the use of out-dated or low-cost hardware. Super-resolution is a class of retrospective techniques that aims at high-resolution imagery by means of software. Multi-frame algorithms approach this task by fusing multiple low-resolution frames to reconstruct high-resolution images. This work covers novel super-resolution methods along with new applications in medical imaging.
Some biometrics methods, especially ocular, may use fine spatial information akin to level-3 features. Examples include fine vascular patterns visible in the white of the eyes in green and blue channels, iridial patterns in near infrared, or minute periocular features in visible light. In some mobile applications, an NIR or RGB camera is used to capture these ocular images in a "selfie" like manner. However, most of such ocular images captured under unconstrained environments are of lower quality due to spatial resolution, noise, and motion blur, affecting the performance of the ensuing biometric authentication. Here we propose a multi-frame super resolution (MFSR) pipeline to mitigate the problem, where a higher resolution image is generated from multiple lower resolution, noisy and blurry images. We show that the proposed MFSR method at 2× upscaling can improve the equal error rate (EER) by 9.85% compared to single frame bicubic upscaling in RGB ocular matching while being up to 8.5× faster than comparable state-of-the-art MFSR method.
Multi-frame super-resolution recovers a high-resolution (HR) image from a sequence of low-resolution (LR) images. In this paper, we propose an algorithm that performs multi-frame super-resolution in an online fashion. This algorithm processes only one low-resolution image at a time instead of co-processing all LR images which is adopted by state-of-the-art super-resolution techniques. Our algorithm is very fast and memory efficient, and simple to implement. In addition, we employ a noise-adaptive parameter in the classical steepest gradient optimization method to avoid noise amplification and overfitting LR images. Experiments with simulated and real-image sequences yield promising results.
With the increased use of closed-circuit television (CCTV) footage for security and surveillance purposes as well as for object or person recognition and efficiency monitoring, high-quality CCTV videos are necessary. In this paper, we propose Corgi Eye, a moving object removal + super-resolution framework for enhancing CCTV footages to remove ghosting artifacts caused by performing multi-frame super-resolution (MISR) on moving objects. Our method extends the framework of Eagle Eye, which is an existing MISR framework tailored for mobile devices. Our results demonstrate that the system can completely remove ghosting effects caused by moving objects while performing MISR on CCTV footage. Our proposed method demonstrates competitive performance when compared to Eagle Eye, achieving a 16% increase in terms of PSNR metric. Additionally, our method can produce clear images, on par with deep learning approaches such as ESPCN and SOF-VSR.
This paper addresses the challenge of keeping up with the ever-increasing graphical complexity of video games and introduces a deep-learning approach to mitigating it. As games get more and more demanding in terms of their graphics, it becomes increasingly difficult to maintain high-quality images while also ensuring good performance. This is where deep learning super sampling (DLSS) comes in. The paper explains how DLSS works, including the use of convolutional autoencoder neural networks and various other techniques and technologies. It also covers how the network is trained and optimized, as well as how it incorporates temporal antialiasing and frame generation techniques to enhance the final image quality. We will also discuss the effectiveness of these techniques as well as compare their performance to running at native resolutions.
No abstract available
The rapid advancement of remote sensing technology has driven a growing demand for high-quality satellite video. However, constrained by transmission bandwidth and storage costs, satellite videos are typically stored and transmitted at low spatial resolutions and frame rates. As a key solution, space–time super-resolution (STSR) aims to enhance video quality. Unfortunately, existing satellite video STSR methods fail to effectively model interframe continuity and, consequently, only support a fixed upsampling scale, which restricts their flexibility and practical applicability. To address this challenge, we propose a novel continuous STSR (CSTSR) framework for satellite videos, termed SV-CSTSR. Specifically, first, we develop a spatial–frequency joint modulation block (SFJMB), which aims to jointly mine the fine-grained features of satellite videos in the spatial and frequency domains by using a dual-branch structure. Second, we propose a mask-based temporal-aware warping module (MTWM) to model continuous feature representations in the temporal dimension by modulating feature warping with a time factor, enabling frame interpolation at arbitrary times. Third, we design an interframe deformable attention module (IDAM), which can explore interframe dependencies to achieve effective information fusion by adaptively selecting the positions of key and value pairs in a data-dependent manner. Finally, we propose a cross-level frequency integration module (CFIM) for continuous-scale upsampling. CFIM aims to activate channel responses through frequency selection (FS), thereby achieving adaptive integration of cross-level frequency latent codes to learn feature representations of satellite videos at arbitrary resolution. Extensive experiments demonstrate that the proposed SV-CSTSR surpasses state-of-the-art (SOTA) methods, achieving superior quantitative accuracy and perceptual fidelity.
In this work, we propose a hybrid learning-based method for layered spatial scalability. Our framework consists of a base layer (BL), which encodes a spatially downsampled representation of the input video using Versatile Video Coding (VVC), and a learning-based enhancement layer (EL), which conditionally encodes the original video signal. The EL is conditioned by two fused prediction signals: a spatial inter-layer prediction signal, that is generated by spatially upsampling the output of the BL using super-resolution, and a temporal inter-frame prediction signal, that is generated by decoder-side motion compensation without signaling any motion vectors. We show that our method outperforms LCEVC and has comparable performance to full-resolution VVC for high-resolution content, while still offering scalability.
In many digital systems, the transmission bandwidth, as well as storage capacity, are usually very limited. This introduces challenges for both video transmission and video storage. To seek lower bit rates and further obtain high-quality up-sampled videos, this paper proposes a temporal down-sampling based video coding system and a frame-recurrent enhancement based video upsampling strategy. The structure of our proposed method is shown in Fig. 1. Unlike the existing work [1], instead of downsampling all video frames, only the intermediate frames are downsampled and two frames remain with high quality on the video coding system. Then, these two high-quality frames are used to iteratively enhance the quality of the low-bitrate low-quality frames through a deep-learned enhancement network. Compared to the latest video coding standard Versatile Video Coding (VVC), our work can obtain a BD-rate reduction from $39.261 {\%} \sim 85. 455$ % in All-Intra and Low-Delay-P configurations on the downsampled frames. A temporal down-sampling based video coding framework (TDS) is proposed. It can be combined with all the existing coding standards including HEVC/H.265 and VVC/H.266. A method of super-resolution with frame recurrent image enhancement (SRFR) is applied to up-sampling the frames by the neighboring high resolution frame. The temporal information from high resolution frames can be fully used to improve the video quality through frame recurrent.
In this paper, we consider the task of space-time video super-resolution (ST-VSR), namely, expanding a given source video to a higher frame rate and resolution simultaneously. However, most existing schemes either consider a fixed intermediate time and scale in the training stage or only accept a preset number of input frames (e.g., two adjacent frames) that fails to exploit long-range temporal information. To address these problems, we propose a continuous ST-VSR (C-STVSR) method that can convert the given video to any frame rate and spatial resolution. To achieve time-arbitrary interpolation, we propose a forward warping guided frame synthesis module and an optical-flow-guided context consistency loss to better approximate extreme motion and preserve similar structures among input and prediction frames. In addition, we design a memory-friendly cascading depth-to-space module to realize continuous spatial upsampling. Meanwhile, with the sophisticated reorganization of optical flow, the proposed method is memory friendly, making it possible to propagate information from long-range neighboring frames and achieve better reconstruction quality. Extensive experiments show that the proposed algorithm has good flexibility and achieves better performance on various datasets compared with the state-of-the-art methods in both objective evaluations and subjective visual effects.
Downsampling is one of the most basic image processing operations. Improper spatio-temporal downsampling applied on videos can cause aliasing issues such as moir\'e patterns in space and the wagon-wheel effect in time. Consequently, the inverse task of upscaling a low-resolution, low frame-rate video in space and time becomes a challenging ill-posed problem due to information loss and aliasing artifacts. In this paper, we aim to solve the space-time aliasing problem by learning a spatio-temporal downsampler. Towards this goal, we propose a neural network framework that jointly learns spatio-temporal downsampling and upsampling. It enables the downsampler to retain the key patterns of the original video and maximizes the reconstruction performance of the upsampler. To make the downsamping results compatible with popular image and video storage formats, the downsampling results are encoded to uint8 with a differentiable quantization layer. To fully utilize the space-time correspondences, we propose two novel modules for explicit temporal propagation and space-time feature rearrangement. Experimental results show that our proposed method significantly boosts the space-time reconstruction quality by preserving spatial textures and motion patterns in both downsampling and upscaling. Moreover, our framework enables a variety of applications, including arbitrary video resampling, blurry frame reconstruction, and efficient video storage.
A video compression framework based on spatio-temporal resolution adaptation (ViSTRA) is proposed, which dynamically resamples the input video spatially and temporally during encoding, based on a quantisation-resolution decision, and reconstructs the full resolution video at the decoder. Temporal upsampling is performed using frame repetition, whereas a convolutional neural network super-resolution model is employed for spatial resolution upsampling. ViSTRA has been integrated into the high efficiency video coding reference software (HM 16.14). Experimental results verified via an international challenge show significant improvements, with BD-rate gains of 15% based on PSNR and an average MOS difference of 0.5 based on subjective visual quality tests.
In the domain of space-time video super-resolution, it is typically challenging to handle complex motions (including large and nonlinear motions) and varying illumination scenes due to the lack of inter-frame information. Leveraging the dense temporal information provided by event signals offers a promising solution. Traditional event-based methods typically rely on multiple images, using motion estimation and compensation, which can introduce errors. Accumulated errors from multiple frames often lead to artifacts and blurriness in the output. To mitigate these issues, we propose EvSTVSR, a method that uses fewer adjacent frames and integrates dense temporal information from events to guide alignment. Additionally, we introduce a coordinate-based feature fusion upsampling module to achieve spatial super-resolution. Experimental results demonstrate that our method not only outperforms existing RGB-based approaches but also excels in handling large motion scenarios.
Space-time video resampling aims to conduct both spatial-temporal downsampling and upsampling processes to achieve high-quality video reconstruction. Although there has been much progress, some major challenges still exist, such as how to preserve motion information during temporal resampling while avoiding blurring artifacts, and how to achieve flexible temporal and spatial resampling factors. In this paper, we introduce an Invertible Motion Steganography Module (IMSM), designed to embed motion information from high-frame-rate videos into downsampled frames with lower frame rates in a visually imperceptible manner. Its reversible nature allows the motion information to be recovered, facilitating the reconstruction of high-frame-rate videos. Furthermore, we propose a 3D implicit feature modulation technique that enables continuous spatiotemporal resampling. With tailored training strategies, our method supports flexible frame rate conversions, including non-integer changes like 30 FPS to 24 FPS and vice versa. Extensive experiments show that our method significantly outperforms existing solutions across multiple datasets in various video resampling tasks with high flexibility. Codes will be made available at the URL https://github.com/hahazh/CSTVR.
As internet video evolves towards higher quality, it poses challenges to the Quality of Experience (QoE) of adaptive streaming systems. To deliver high visual quality while avoiding rebuffering, we propose Gecko, an adaptive streaming system based on Prompt Inversion. At the media server, high-quality video is inverted to low-bitrate prompts. At the client, the received prompts are used to reconstruct high-fidelity video. To support videos with large-scale movements, a temporal-structural prompt is proposed to explicitly control temporal changes. To support high-resolution, an inverse upsampling algorithm is introduced, which integrates upsampling into the inversion. To further reduce the bandwidth usage, a chunk-wise inverse prompt is proposed. We implement Gecko on Puffer, with fine-grained integration of both browser client and media server. Evaluations under real-world network traces demonstrate Gecko can reduce bandwidth usage by 10x compared to H.264, avoid rebuffering by 91.2% compared to DVC. Moreover, Gecko can generate 4K videos at 69 FPS with a single RTX 4090D GPU.
Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we propose a strong baseline BasicAVSR for AVSR by integrating four key components: 1) adaptive multi-scale frequency priors generated from image Laplacian pyramids, 2) a flow-guided propagation unit to aggregate spatiotemporal information from adjacent frames, 3) a second-order motion compensation unit for more accurate spatial alignment of adjacent frames, and 4) a hyper-upsampling unit to generate scale-aware and content-independent upsampling kernels. To meet diverse application demands, we instantiate three propagation variants: (i) a unidirectional RNN unit for strictly online inference, (ii) a unidirectional RNN unit empowered with a limited lookahead that tolerates a small output delay, and (iii) a bidirectional RNN unit designed for offline tasks where computational resources are less constrained. Experimental results demonstrate the effectiveness and adaptability of our model across these different scenarios. Through extensive experiments, we show that BasicAVSR significantly outperforms existing methods in terms of super-resolution quality, generalization ability, and inference speed. Our work not only advances the state-of-the-art in AVSR but also extends its core components to multiple frameworks for diverse scenarios. The code is available at https://github.com/shangwei5/BasicAVSR.
No abstract available
In this paper, we present VideoGen, a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt, as a reference image to guide video generation. Then, we introduce an efficient cascaded latent diffusion module conditioned on both the reference image and the text prompt, for generating latent video representations, followed by a flow-based temporal upsampling step to improve the temporal resolution. Finally, we map latent video representations into a high-definition video through an enhanced video decoder. During training, we use the first frame of a ground-truth video as the reference image for training the cascaded latent diffusion module. The main characterises of our approach include: the reference image generated by the text-to-image model improves the visual fidelity; using it as the condition makes the diffusion model focus more on learning the video dynamics; and the video decoder is trained over unlabeled video data, thus benefiting from high-quality easily-available videos. VideoGen sets a new state-of-the-art in text-to-video generation in terms of both qualitative and quantitative evaluation. See \url{https://videogen.github.io/VideoGen/} for more samples.
Arbitrary-scale video super-resolution (AVSR) aims to enhance the resolution of video frames, potentially at various scaling factors, which presents several challenges regarding spatial detail reproduction, temporal consistency, and computational complexity. In this paper, we first describe a strong baseline for AVSR by putting together three variants of elementary building blocks: 1) a flow-guided recurrent unit that aggregates spatiotemporal information from previous frames, 2) a flow-refined cross-attention unit that selects spatiotemporal information from future frames, and 3) a hyper-upsampling unit that generates scaleaware and content-independent upsampling kernels. We then introduce ST-AVSR by equipping our baseline with a multi-scale structural and textural prior computed from the pre-trained VGG network. This prior has proven effective in discriminating structure and texture across different locations and scales, which is beneficial for AVSR. Comprehensive experiments show that ST-AVSR significantly improves super-resolution quality, generalization ability, and inference speed over the state-of-theart. The code is available at https://github.com/shangwei5/ST-AVSR.
Video rescaling helps to fit different display devices. In video rescaling systems, videos are downsampled for easier storage, transmission and preview. The downsampled videos can be upsampled with a neural network to restore the details when needed. Previous group-based video rescaling algorithms benefit from the joint downsampling and joint upsampling of multiple frames, but are restricted by the fully joint operation. In this paper, we propose a recurrent diffusion-based framework for video rescaling. We employ biased joint operation and recurrent diffusion, to make a better use of the temporal relation within different frames in each image group. We explicitly control the direction of information propagation by arranging the processing order of all frames. In biased joint operation, we concentrate on restoring one frame, i.e., the middle frame. The other frames in the group are coarsely reconstructed. Our recurrent diffusion compensates the coarse frames by gradually propagating information from the middle to borders backwardly and forwardly. The recurrent diffusion module is performed by fusing the information of adjacent frames. Biased joint operation and recurrent diffusion are jointly trained. We design several propagation variants and find that our recurrent diffusion is the best among them. It is also shown that recurrent diffusion is better than non-recurrent diffusion in terms of reconstruction quality and model size. We also adopt a high-resolution fine-tuning strategy to further improve the quality of high-resolution frames. Experimental results demonstrate the effectiveness of the proposed method in terms of visual quality, quantitative evaluations, and computational efficiency. The code will be released at https://github.com/5ofwind/RDVR.
We propose a Dynamic Context-Guided Upsampling (DCGU) module for video super-resolution (VSR) that leverages temporal context guidance to achieve efficient and effective arbitrary-scale VSR. While most VSR research focuses on backbone design, the importance of the upsampling part is often overlooked. Existing methods rely on pixelshuffle-based upsampling, which has limited capabilities in handling arbitrary upsampling scales. Recent attempts to replace pixelshuffle-based modules with implicit neural function-based and filter-based approaches suffer from slow inference speeds and limited representation capacity, respectively. To overcome these limitations, our DCGU module predicts non-local sampling locations and content-dependent filter weights, enabling efficient and effective arbitrary-scale VSR. Our proposed multi-granularity location search module efficiently identifies non-local sampling locations across the entire low-resolution grid, and the temporal bilateral filter modulation module integrates content information with the filter weight to enhance textual details. Extensive experiments demonstrate the superiority of our method in terms of performance and speed on arbitrary-scale VSR.
Space-time video super-resolution (ST-VSR) aims to simultaneously expand a given source video to a higher frame rate and resolution. However, most existing schemes either consider fixed intermediate time and scale or fail to exploit long-range temporal information due to model design or inefficient motion estimation and compensation. To address these problems, we propose a continuous ST-VSR method to convert the given video to any frame rate and spatial resolution with Multi-stage Motion information reorganization (MsMr). To achieve time-arbitrary interpolation, we propose a forward warping guided frame synthesis module and an optical flow-guided context consistency loss to better approximate extreme motion and preserve similar structures among input and prediction frames. To realize continuous spatial upsampling, we design a memory-friendly cascading depth-to-space module. Meanwhile, with the sophisticated reorganization of optical flow, MsMr realizes more efficient motion estimation and motion compensation, making it possible to propagate information from long-range neighboring frames and achieve better reconstruction quality. Extensive experiments show that the proposed algorithm is flexible and performs better on various datasets than the state-of-the-art methods. The code will be available at https://github.com/hahazh/LD-STVSR.
No abstract available
Diffusion models (DMs) have demonstrated exceptional success in video super-resolution (VSR), showcasing a powerful capacity for generating fine-grained details. However, their potential for space-time video super-resolution (STVSR), which necessitates not only recovering realistic visual content from low-resolution to high-resolution but also improving the frame rate with coherent temporal dynamics, remains largely underexplored. Moreover, existing STVSR methods predominantly address spatiotemporal upsampling under simplified degradation assumptions, which often struggle in real-world scenarios with complex unknown degradations. Such a high demand for reconstruction fidelity and temporal consistency makes the development of a robust STVSR framework particularly non-trivial. To address these challenges, we propose OSDEnhancer, a novel framework that, to the best of our knowledge, represents the first method to achieve real-world STVSR through an efficient one-step diffusion process. OSDEnhancer initializes essential spatiotemporal structures through a linear pre-interpolation strategy and pivots on training temporal refinement and spatial enhancement mixture of experts (TR-SE MoE), which allows distinct expert pathways to progressively learn robust, specialized representations for temporal coherence and spatial detail, further collaboratively reinforcing each other during inference. A bidirectional deformable variational autoencoder (VAE) decoder is further introduced to perform recurrent spatiotemporal aggregation and propagation, enhancing cross-frame reconstruction fidelity. Experiments demonstrate that the proposed method achieves state-of-the-art performance while maintaining superior generalization capability in real-world scenarios.
Video super-resolution often reconstructs high-resolution (HR) video from low-resolution (LR) video that has been downsampled using predefined methods, which is an ill-posedness problem. Recent video rescaling algorithms alleviate this problem by jointly training the downsampling and upsampling processes. However, they primarily exploit the shallow temporal correlations among video frames, overlooking the intricate, long-term sequential depth dependencies within the video. In this paper, we propose an omniscient feature alignment to leverage the bidirectional deep temporal information for video rescaling, namely OFA-VRN. In the downsampling phase, the proposed method separates the input HR video into LR frames and high-frequency components using haar wavelet transform and explicitly embeds the high-frequency components into the LR frames. In this way, detailed information is stored in the frame and maintains visual perception quality in downsampled videos. During the upsampling phase, we use an advanced bidirectional propagation paradigm to enhance temporal information aggregation capabilities. By incorporating the proposed omniscient feature alignment, the network is capable of leveraging multi-frame feature information from the triplet dimension to further alleviate misalignment issues, thereby enhancing its capacity for deep temporal information utilization. The experiments on Vid4 and Vimeo90K-T demonstrate that our model achieves competitive performance compared to the state-of-the-art methods.
Video restoration task aims to recover high-quality videos from low-quality observations. This contains various important sub-tasks, such as video denoising, deblurring and low-light enhancement, since video often faces different types of degradation, such as blur, low light, and noise. Even worse, these kinds of degradation could happen simultaneously when taking videos in extreme environments. This poses significant challenges if one wants to remove these artifacts at the same time. In this paper, to the best of our knowledge, we are the first to propose an efficient end-to-end video transformer approach for the joint task of video deblurring, low-light enhancement, and denoising. This work builds a novel multi-tier transformer where each tier uses a different level of degraded video as a target to learn the features of video effectively. Moreover, we carefully design a new tier-to-tier feature fusion scheme to learn video features incrementally and accelerate the training process with a suitable adaptive weighting scheme. We also provide a new Multiscene-Lowlight-Blur-Noise (MLBN) dataset, which is generated according to the characteristics of the joint task based on the RealBlur dataset and YouTube videos to simulate realistic scenes as far as possible. We have conducted extensive experiments, compared with many previous state-of-the-art methods, to show the effectiveness of our approach clearly.
The exploitation of long-term information has been a long-standing problem in video restoration. The recent BasicVSR and BasicVSR++ have shown remarkable performance in video super-resolution through long-term propagation and effective alignment. Their success has led to a question of whether they can be transferred to different video restoration tasks. In this work, we extend BasicVSR++ to a generic framework for video restoration tasks. In tasks where inputs and outputs possess identical spatial size, the input resolution is reduced by strided convolutions to maintain efficiency. With only minimal changes from BasicVSR++, the proposed framework achieves compelling performance with great efficiency in various video restoration tasks including video deblurring and denoising. Notably, BasicVSR++ achieves comparable performance to Transformer-based approaches with up to 79% of parameter reduction and 44x speedup. The promising results demonstrate the importance of propagation and alignment in video restoration tasks beyond just video super-resolution. Code and models are available at https://github.com/ckkelvinchan/BasicVSR_PlusPlus.
Video restoration aims to reconstruct high quality video sequences from low quality inputs, addressing tasks such as super resolution, denoising, and deblurring. Traditional regression based methods often produce unrealistic details and require extensive paired datasets, while recent generative diffusion models face challenges in ensuring temporal consistency. We introduce DiTVR, a zero shot video restoration framework that couples a diffusion transformer with trajectory aware attention and a wavelet guided, flow consistent sampler. Unlike prior 3D convolutional or frame wise diffusion approaches, our attention mechanism aligns tokens along optical flow trajectories, with particular emphasis on vital layers that exhibit the highest sensitivity to temporal dynamics. A spatiotemporal neighbour cache dynamically selects relevant tokens based on motion correspondences across frames. The flow guided sampler injects data consistency only into low-frequency bands, preserving high frequency priors while accelerating convergence. DiTVR establishes a new zero shot state of the art on video restoration benchmarks, demonstrating superior temporal consistency and detail preservation while remaining robust to flow noise and occlusions.
Video distortion seriously affects user experience and downstream tasks. Existing video restoration methods still suffer from high-frequency detail loss, limited spatio-temporal dependency modeling, and high computational complexity. In this letter, we propose a novel video restoration method based on full-frequency spatio-temporal information enhancement (FFSTIE). The proposed FFSTIE includes an implicit alignment module for accurate recovery of high-frequency details and a full-frequency feature reconstruction module for adaptive enhancement of frequency components. Comprehensive experiments with quantitative and qualitative comparisons demonstrate the effectiveness of our FFSTIE method. On the video deblurring dataset DVD, FFSTIE achieves 0.75% improvement in PSNR and 1.08% improvement in SSIM with 35% fewer parameters and 59% lower GMAC compared to VDTR (TCSVT'2023), achieving a balance between performance and efficiency. On the video denoising dataset DAVIS, FFSTIE achieves the best performance with an average of 35.36 PSNR and 0.9347 SSIM, surpassing existing unsupervised methods.
Dynamic scene video deblurring is a challenging task due to the spatially variant blur inflicted by independently moving objects and camera shakes. Recent deep learning works bypass the ill-posedness of explicitly deriving the blur kernel by learning pixel-to-pixel mappings, which is commonly enhanced by larger region awareness. This is a difficult yet simplified scenario because noise is neglected when it is omnipresent in a wide spectrum of video processing applications. Despite its relevance, the problem of concurrent noise and dynamic blur has not yet been addressed in the deep learning literature. To this end, we analyze existing state-of-the-art deblurring methods and encounter their limitations in handling non-uniform blur under strong noise conditions. Thereafter, we propose a first-to-date work that addresses blur- and noise-free frame recovery by casting the restoration problem into a multi-task learning framework. Our contribution is threefold: a) We propose R2-D4, a multi-scale encoder architecture attached to two cascaded decoders performing the restoration task in two steps. b) We design multi-scale residual dense modules, bolstered by our modulated efficient channel attention, to enhance the encoder representations via augmenting deformable convolutions to capture longer-range and object-specific context that assists blur kernel estimation under strong noise. c) We perform extensive experiments and evaluate state-of-the-art approaches on a publicly available dataset under different noise levels. The proposed method performs favorably under all noise levels while retaining a reasonably low computational and memory footprint.
Aiming at the problem that blurred digital video is easy to lose inter-frame information and ignore spatiotemporal during restoration, a video image deblurring algorithm based on denoising engine is proposed. We extend the adaptive Laplacian regularization term constructed by the denoising engine to the field of video image restoration. Firstly, the video image self-similarity redundant information can be gained through the nonlocal means (NLM) regularization, and then we present a new restoration model by mixing different regularizers, especially by combining NLM regularizer with the denoising regularizer. To solve the video image restoration model, we used the simplest gradient descent method. The experimental results show that our method has good deblurring effect and certain robustness to noise.
Video processing is essential in entertainment, surveillance, and communication. This research presents a strong framework that improves video clarity and decreases bitrate via advanced restoration and compression methods. The suggested framework merges various deep learning models such as super-resolution, deblurring, denoising, and frame interpolation, in addition to a competent compression model. Video frames are first compressed using the libx265 codec in order to reduce bitrate and storage needs. After compression, restoration techniques deal with issues like noise, blur, and loss of detail. The video restoration transformer (VRT) uses deep learning to greatly enhance video quality by reducing compression artifacts. The frame resolution is improved by the super-resolution model, motion blur is fixed by the deblurring model, and noise is reduced by the denoising model, resulting in clearer frames. Frame interpolation creates additional frames between existing frames to create a smoother video viewing experience. Experimental findings show that this system successfully improves video quality and decreases artifacts, providing better perceptual quality and fidelity. The real-time processing capabilities of the technology make it well-suited for use in video streaming, surveillance, and digital cinema.
This paper presents a novel method for restoring digital videos via a Deep Plug-and-Play (PnP) approach. Under a Bayesian formalism, the method consists in using a deep convolutional denoising network in place of the proximal operator of the prior in an alternating optimization scheme. We distinguish ourselves from prior PnP work by directly applying that method to restore a digital video from a degraded video observation. This way, a network trained once for denoising can be repurposed for other video restoration tasks. Our experiments in video deblurring, super-resolution, and interpolation of random missing pixels all show a clear benefit to using a network specifically designed for video denoising, as it yields better restoration performance and better temporal stability than a single image network with similar denoising performance using the same PnP formulation. Moreover, our method compares favorably to applying a different state-of-the-art PnP scheme separately on each frame of the sequence. This opens new perspectives in the field of video restoration.
Long-range temporal alignment is critical yet challenging for video restoration tasks. Recently, some works attempt to divide the long-range alignment into several sub-alignments and handle them progressively. Although this operation is helpful in modeling distant correspondences, error accumulation is inevitable due to the propagation mechanism. In this work, we present a novel, generic iterative alignment module which employs a gradual refinement scheme for sub-alignments, yielding more accurate motion compensation. To further enhance the alignment accuracy and temporal consistency, we develop a non-parametric re-weighting method, where the importance of each neighboring frame is adaptively evaluated in a spatial-wise way for aggregation. By virtue of the proposed strategies, our model achieves state-of-the-art performance on multiple benchmarks across a range of video restoration tasks including video super-resolution, denoising and deblurring.
No abstract available
No abstract available
Employing specific networks to address different types of degradation often proved to be complex and time-consuming in practical applications. The Bracket Image Restoration and Enhancement (BIRE) aimed to address various image restoration tasks in a unified manner by restoring clear single-frame images from multiple-frame shots, including denoising, deblurring, enhancing high dynamic range (HDR), and achieving super-resolution under various degradation conditions. In this paper, we propose LGSTANet, an efficient aggregation restoration network for BIRE. Specifically, inspired by video restoration methods, we adopt an efficient architecture comprising alignment, aggregation, and reconstruction. Additionally, we introduce a Learnable Global Spatio-Temporal Adaptive (LGSTA) aggregation module to effectively aggregate inter-frame complementary information. Furthermore, we propose an adaptive restoration modulator to address specific degradation disturbances of various types, thereby achieving high-quality restoration outcomes. Extensive experiments demonstrate the effectiveness of our method. LGSTANet outperforms other state-of-the-art methods in Bracket Image Restoration and Enhancement and achieves competitive results in the NTIRE2024 BIRE challenge.
No abstract available
No abstract available
Pixel recovery with deep learning has shown to be very effective for a variety of low-level vision tasks like image super-resolution, denoising, and deblurring. Most existing works operate in the spatial domain, and there are few works that exploit the transform domain for image restoration tasks. In this paper, we present a transform domain approach for image deblocking using a deep neural network called DCTResNet. Our application is compressed video motion deblur, where the input video frame has blocking artifacts that make the deblurring task very challenging. Specifically, we use a block-wise Discrete Cosine Transform (DCT) to decompose the image into its low and high-frequency sub-band images and exploit the strong sub-band specific features for more effective deblocking solutions. Since JPEG also uses DCT for image compression, using DCT sub-band images for image deblocking helps to learn the JPEG compression prior to effectively correct the blocking artifacts. Our experimental results show that both PSNR and SSIM for DCTResNet perform more favorably than other state-of-the-art (SOTA) methods, while significantly faster in inference time.
Face Video Restoration (FVR) aims to reconstruct high-quality face videos from degraded input. Traditional methods struggle to preserve fine-grained, identity-specific features when degradation is severe, often producing average-looking faces that lack individual characteristics. To address these challenges, we introduce IP-FVR, a novel method that leverages a high-quality reference face image as a visual prompt to provide identity conditioning during the denoising process. IP-FVR incorporates semantically rich identity information from the reference image using decoupled cross-attention mechanisms, ensuring detailed and identity consistent results. For intra-clip identity drift (within 24 frames), we introduce an identity-preserving feedback learning method that combines cosine similarity-based reward signals with suffix-weighted temporal aggregation. This approach effectively minimizes drift within sequences of frames. For inter-clip identity drift, we develop an exponential blending strategy that aligns identities across clips by iteratively blending frames from previous clips during the denoising process. This method ensures consistent identity representation across different clips. Additionally, we enhance the restoration process with a multi-stream negative prompt, guiding the model's attention to relevant facial attributes and minimizing the generation of low-quality or incorrect features. Extensive experiments on both synthetic and real-world datasets demonstrate that IP-FVR outperforms existing methods in both quality and identity preservation, showcasing its substantial potential for practical applications in face video restoration. Our code and datasets are available at https://ip-fvr.github.io/.
In this paper, we propose the first diffusion-based all-in-one video restoration method that utilizes the power of a pre-trained Stable Diffusion and a fine-tuned ControlNet. Our method can restore various types of video degradation with a single unified model, overcoming the limitation of standard methods that require specific models for each restoration task. Our contributions include an efficient training strategy with Task Prompt Guidance (TPG) for diverse restoration tasks, an inference strategy that combines Denoising Diffusion Implicit Models~(DDIM) inversion with a novel Sliding Window Cross-Frame Attention (SW-CFA) mechanism for enhanced content preservation and temporal consistency, and a scalable pipeline that makes our method all-in-one to adapt to different video restoration tasks. Through extensive experiments on five video restoration tasks, we demonstrate the superiority of our method in generalization capability to real-world videos and temporal consistency preservation over existing state-of-the-art methods. Our method advances the video restoration task by providing a unified solution that enhances video quality across multiple applications.
A spatio-temporal video filtering approach is proposed in this article. The video restoration technique introduced here removes successfully the signalindependent noises, such as the additive white Gaussian noise, that deteriorate the frame sequences. The proposed framework is based on a novel 3D fourth-order nonlinear reaction-diffusion model that reduces the additive white Gaussian noise (AWGN) considerably, overcomes the side-effects and deals properly with the inter-frame correlation problem. A rigorous mathematical treatment is performed on this model, its validity being investigated. A numerical approximation algorithm that solves this nonlinear partial differential equation (PDE)-based model is then provided in this paper and applied successfully in the video denoising tests that are also described here.
No abstract available
High-resolution (HR) videos play a crucial role in many computer vision applications. Although existing video restoration (VR) methods can significantly enhance video quality by exploiting temporal information across video frames, they are typically trained for fixed upscaling factors and lack the flexibility to handle scales or degradations beyond their training distribution. In this paper, we introduce VR-INR, a novel video restoration approach based on Implicit Neural Representations (INRs) that is trained only on a single upscaling factor ($\times 4$) but generalizes effectively to arbitrary, unseen super-resolution scales at test time. Notably, VR-INR also performs zero-shot denoising on noisy input, despite never having seen noisy data during training. Our method employs a hierarchical spatial-temporal-texture encoding framework coupled with multi-resolution implicit hash encoding, enabling adaptive decoding of high-resolution and noise-suppressed frames from low-resolution inputs at any desired magnification. Experimental results show that VR-INR consistently maintains high-quality reconstructions at unseen scales and noise during training, significantly outperforming state-of-the-art approaches in sharpness, detail preservation, and denoising efficacy.
Diffusion models have emerged as powerful priors for single-image restoration, but their application to zero-shot video restoration suffers from temporal inconsistencies due to the stochastic nature of sampling and complexity of incorporating explicit temporal modeling. In this work, we address the challenge of improving temporal coherence in video restoration using zero-shot image-based diffusion models without retraining or modifying their architecture. We propose two complementary inference-time strategies: (1) Perceptual Straightening Guidance (PSG) based on the neuroscience-inspired perceptual straightening hypothesis, which steers the diffusion denoising process towards smoother temporal evolution by incorporating a curvature penalty in a perceptual space to improve temporal perceptual scores, such as Fr\'echet Video Distance (FVD) and perceptual straightness; and (2) Multi-Path Ensemble Sampling (MPES), which aims at reducing stochastic variation by ensembling multiple diffusion trajectories to improve fidelity (distortion) scores, such as PSNR and SSIM, without sacrificing sharpness. Together, these training-free techniques provide a practical path toward temporally stable high-fidelity perceptual video restoration using large pretrained diffusion models. We performed extensive experiments over multiple datasets and degradation types, systematically evaluating each strategy to understand their strengths and limitations. Our results show that while PSG enhances temporal naturalness, particularly in case of temporal blur, MPES consistently improves fidelity and spatio-temporal perception--distortion trade-off across all tasks.
How to effectively explore spatial and temporal information is important for video deblurring. In contrast to existing methods that directly align adjacent frames without discrimination, we develop a deep discriminative spatial and temporal network to facilitate the spatial and temporal feature exploration for better video deblurring. We first develop a channel-wise gated dynamic network to adaptively explore the spatial information. As adjacent frames usually contain different contents, directly stacking features of adjacent frames without discrimination may affect the latent clear frame restoration. Therefore, we develop a simple yet effective discriminative temporal feature fusion module to obtain useful temporal features for latent frame restoration. Moreover, to utilize the information from long-range frames, we develop a wavelet-based feature propagation method that takes the discriminative temporal feature fusion module as the basic unit to effectively propagate main structures from long-range frames for better video deblurring. Experimental results show that the proposed method performs favorably against state-of-the-art ones on benchmark datasets in terms of accuracy and model complexity.
In this paper, we introduce DiQP; a novel Transformer-Diffusion model for restoring 8K video quality degraded by codec compression. To the best of our knowledge, our model is the first to consider restoring the artifacts introduced by various codecs (AV1, HEVC) by Denoising Diffusion without considering additional noise. This approach allows us to model the complex, non-Gaussian nature of compression artifacts, effectively learning to reverse the degradation. Our architecture combines the power of Transformers to capture long-range dependencies with an enhanced windowed mechanism that preserves spatiotemporal context within groups of pixels across frames. To further enhance restoration, the model incorporates auxiliary “Look Ahead” and “Look Around” modules, providing both future and surrounding frame information to aid in reconstructing fine details and enhancing overall visual quality. Extensive experiments on different datasets demonstrate that our model outperforms state-of-the-art methods, particularly for high-resolution videos such as 4K and 8K, showcasing its effectiveness in restoring perceptually pleasing videos from highly compressed sources. 11https://github.com/alimd94/DiQP.
A novel spatio-temporal video denoising and restoration framework is introduced in this article. The proposed filtering technique deals effectively with mixtures composed of both signal-independent and signal-dependent noise components, which deteriorate the movie sequences. It is based on a new well-posed 3D nonlinear second-order reaction- diffusion model that address properly the inter-frame correlation issue. A discretization algorithm which solves numerically this nonlinear PDE-based model is then proposed and used successfully in the video noise reduction experiments that are also described in this work.
To reconstruct high dynamic range (HDR) video from alternating exposed low dynamic range (LDR) frames, the key is to address the misalignment and imprecise fusion caused by information loss and noise in ill-exposed regions. Following a coarse-to-fine manner, a Multi-level Spatial-Temporal feature aggregation and alignment-based Selective Residual Dense Propagation Network (MSTSRDPNet) is proposed. The Multi-level Spatial-Temporal aggregation extracts spatial-temporal features and aggregates them to mitigate information loss for fusion. The alignment-based Selective Residual Dense Propagation module reconstructs the aligned feature by using channel attention to redistribute feature weights while leveraging residual dense connections for information propagation. Experiments show that the proposed MSTSRDPNet outperforms all conventional methods on the synthetic dataset with PSNR-T, HDR-VQM, and HDR-VDP-2 scores of 44.64 dB, 86.83, and 73.9.
Video prediction is essential for recreating absent frames in video sequences while maintaining temporal and spatial coherence. This procedure, known as video inpainting, seeks to reconstruct missing segments by utilizing data from available frames. Frame interpolation, a fundamental component of this methodology, detects and produces intermediary frames between input sequences. The suggested methodology presents a Bidirectional Video Prediction Network (BVPN) for precisely forecasting absent frames that occur before, after, or between specified input frames. The BVPN framework incorporates temporal aggregation and recurrent propagation to improve forecast accuracy. Temporal aggregation employs a series of reference frames to generate absent content by harnessing existing spatial and temporal data, hence assuring seamless coherence. Recurrent propagation enhances temporal consistency by integrating pertinent information from prior time steps to progressively improve predictions. The timing of frames is constantly controlled through intermediate activations in the BVPN, allowing for accurate synchronization and improved temporal alignment. A fusion module integrates intermediate interpretations to generate cohesive final outputs. Experimental assessments indicate that the suggested method surpasses current state-of-the-art techniques in video inpainting and prediction, attaining enhanced smoothness and precision. Surveillance video datasets demonstrate substantial enhancements in predictive accuracy, highlighting the strength and efficacy of the suggested strategy in practical application.• The proposed method integrates bidirectional video prediction, temporal aggregation, and recurrent propagation to effectively reconstruct missing intermediate video frames with enhanced accuracy.• Comparative analysis using the UCF-Crime dataset demonstrates higher PSNR and SSIM values for the proposed method, indicating improved frame quality and temporal consistency over existing techniques.• This research provides a robust framework for future advancements in video frame prediction, contributing to applications in anomaly detection, surveillance, and video restoration.
High-resolution (HR) medical videos are vital for accurate diagnosis, yet are hard to acquire due to hardware limitations and physiological constraints. Clinically, the collected low-resolution (LR) medical videos present unique challenges for video super-resolution (VSR) models, including camera shake, noise, and abrupt frame transitions, which result in significant optical flow errors and alignment difficulties. Additionally, tissues and organs exhibit continuous and nuanced structures, but current VSR models are prone to introducing artifacts and distorted features that can mislead doctors. To this end, we propose MedVSR, a tailored framework for medical VSR. It first employs Cross State-Space Propagation (CSSP) to address the imprecise alignment by projecting distant frames as control matrices within state-space models, enabling the selective propagation of consistent and informative features to neighboring frames for effective alignment. Moreover, we design an Inner State-Space Reconstruction (ISSR) module that enhances tissue structures and reduces artifacts with joint long-range spatial feature learning and large-kernel short-range information aggregation. Experiments across four datasets in diverse medical scenarios, including endoscopy and cataract surgeries, show that MedVSR significantly outperforms existing VSR models in reconstruction performance and efficiency. Code released at https://github.com/CUHK-AIM-Group/MedVSR.
State Space Models (SSMs)-most notably RNNs-have historically played a central role in sequential modeling. Although attention mechanisms such as Transformers have since dominated due to their ability to model global context, their quadratic complexity and limited scalability make them less suited for long sequences. Video super-resolution (VSR) methods have traditionally relied on recurrent architectures to propagate features across frames. However, such approaches suffer from well-known issues including vanishing gradients, lack of parallelism, and slow inference speed. Recent advances in selective SSMs like Mamba offer a compelling alternative: by enabling input-dependent state transitions with linear-time complexity, Mamba mitigates these issues while maintaining strong long-range modeling capabilities. Despite this potential, Mamba alone struggles to capture fine-grained spatial dependencies due to its causal nature and lack of explicit context aggregation. To address this, we propose a hybrid architecture that combines shifted window self-attention for spatial context aggregation with Mamba-based selective scanning for efficient temporal propagation. Furthermore, we introduce Gather-Scatter Mamba (GSM), an alignment-aware mechanism that warps features toward a center anchor frame within the temporal window before Mamba propagation and scatters them back afterward, effectively reducing occlusion artifacts and ensuring effective redistribution of aggregated information across all frames. The official implementation is provided at: https://github.com/Ko-Lani/GSMamba.
LiDAR representation learning aims to extract rich structural and semantic information from large-scale, readily available datasets, reducing reliance on costly human annotations. However, existing LiDAR representation strategies often overlook the inherent spatiotemporal cues in LiDAR sequences, limiting their effectiveness. In this work, we propose LiMA, a novel long-term image-to-LiDAR Memory Aggregation framework that explicitly captures longer range temporal correlations to enhance LiDAR representation learning. LiMA comprises three key components: 1) a Cross-View Aggregation module that aligns and fuses overlapping regions across neighboring camera views, constructing a more unified and redundancy-free memory bank; 2) a Long-Term Feature Propagation mechanism that efficiently aligns and integrates multi-frame image features, reinforcing temporal coherence during LiDAR representation learning; and 3) a Cross-Sequence Memory Alignment strategy that enforces consistency across driving sequences, improving generalization to unseen environments. LiMA maintains high pretraining efficiency and incurs no additional computational overhead during downstream tasks. Extensive experiments on mainstream LiDAR-based perception benchmarks demonstrate that LiMA significantly improves both LiDAR semantic segmentation and 3D object detection. We hope this work inspires more effective pretraining paradigms for autonomous driving. The code has be made publicly accessible for future research.
Existing salient object detection (SOD) models are generally constrained by the limited receptive fields of convolutional neural networks (CNNs) and quadratic computational complexity of Transformers. Recently, the emerging state-space model, namely Mamba, has shown great potential in balancing global receptive fields and computational efficiency. As a solution, we propose Saliency Mamba (Samba), a pure Mamba-based architecture that flexibly handles various distinct SOD tasks, including RGB/RGB-D/RGB-T SOD, video SOD (VSOD), RGB-D VSOD, and visible-depth-thermal SOD. Specifically, we rethink the scanning strategy of Mamba for SOD, and introduce a saliency-guided Mamba block (SGMB) that features a spatial neighborhood scanning (SNS) algorithm to preserve the spatial continuity of salient regions. A context-aware upsampling (CAU) method is also proposed to promote hierarchical feature alignment and aggregation by modeling contextual dependencies. As one step further, to avoid the"task-specific"problem as in previous SOD solutions, we develop Samba+, which is empowered by training Samba in a multi-task joint manner, leading to a more unified and versatile model. Two crucial components that collaboratively tackle challenges encountered in input of arbitrary modalities and continual adaptation are investigated. Specifically, a hub-and-spoke graph attention (HGA) module facilitates adaptive cross-modal interactive fusion, and a modality-anchored continual learning (MACL) strategy alleviates inter-modal conflicts together with catastrophic forgetting. Extensive experiments demonstrate that Samba individually outperforms existing methods across six SOD tasks on 22 datasets with lower computational cost, whereas Samba+ achieves even superior results on these tasks and datasets by using a single trained versatile model. Additional results further demonstrate the potential of our Samba framework.
A major challenge of the video inpainting task is aggregating spatial and temporal information in the corrupted video effectively. In this paper, we propose a dynamic graph memory bank to settle this challenge. To model the long-range temporal dependency, a memory bank is built and updated dynamically with the input visual information flow. The relationships among the memory items are modeled through a graph-based message propagation. Benefiting from the dynamic graph memory bank, both contents and their relationships in the corrupted video are well exploited as the inpainting process going on. Besides, the spatial misalignment across different frames may degrade the quality of features in the dynamic graph memory bank. To alleviate this issue, we propose a motion-guided feature alignment module. The proposed module cooperates with the dynamic graph memory bank to improve the network’s information aggregation ability in spatial and temporal dimensions. Extensive experiments on the YouTube-VOS and DAVIS datasets demonstrate the superiority of our approach when compared with the state-of-the-arts.
Research on face video super-resolution has made significant strides at 2x and 4x magnification, but there is comparatively less work on higher magnification tasks. Leveraging the spatial processing capabilities of Convolutional Neural Network (CNN) and the long-range dependency modeling of Transformers, this paper presents Face Video Super-resolution CNN Transformer (FVSR-CT), an effective CNN- and Transformer-based model designed for high-magnification face video super-resolution tasks. However, designing an appropriate CNN- and Transformer-based model for high-magnification face video super-resolution is challenging due to the lack of sufficient spatial information in the input frames, difficulties in inter-frame feature alignment, and the high computational costs associated with high-resolution spatial modeling. To address these challenges, FVSR-CT advocates using Multi-channel Spatial Encoding to extract and enhance information from the input frames, employing Inter-frame Point-wise Masked Attention to establish inter-frame alignment, and implementing a Low-rank Decomposition Reconstruction Method to optimize the parameters of the attention mechanism. Compared to existing advanced models, our proposed method achieves highly competitive results. Such as CNN- and Transformer-based model can serve as a baseline for video super-resolution and other video reconstruction tasks.
With the growing demand for high-definition video in applications such as surveillance, streaming, and virtual reality, video super-resolution (VSR) has become a key technology. Due to the temporal correlation between frames in video, the VSR task needs to focus on how to capture and exploit the inter-frame dependencies in its design in order to solve problems such as occlusion and temporal misalignment, which are important features that distinguish it from single-image super-resolution. In this paper, we propose a variable sliding window VSR method based on Swin Transformer to address temporal misalignment and varying motion patterns in video sequences. By dynamically adjusting attention window sizes based on motion intensity, the model effectively balances local detail enhancement and long-range dependency modeling. The framework includes motion estimation, adaptive patch alignment, and multi-frame self-attention fusion, optimized using the Charbonnier loss. Our method was evaluated on the REDS4 dataset and compared with other baseline VSR approaches using PSNR and SSIM. This adaptive design enhances both reconstruction quality and computational efficiency, offering a robust solution for high-fidelity video restoration.
Video Super-Resolution (VSR) is a key technology for upgrading the perceptual quality of low-resolution video streams, with direct relevance to surveillance, remote sensing, and environmental observation. We introduce GridFormer-VSR, a Vision Transformer (ViT) architecture that simultaneously restores fine spatial structures and models long-range inter-frame relations. The design fuses two complementary attention operators: a local, windowed self-attention for neighborhood reasoning and a grid-organized global attention that promotes scene-wide information exchange. In parallel, a lightweight HaloMBConv pathway lowers computational overhead, while preserving edge and texture fidelity. For assessment, we assemble a new benchmark by generating temporally aligned video clips from the AID remote-sensing corpus, and we propose an evaluation suite that jointly examines image fidelity and temporal stability. Across multiple benchmarks, GridFormer-VSR establishes state-of-the-art performance, yielding consistent gains in Peak Signal-to-Noise Ratio (PSNR), Structural Similarity (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS), while providing low-latency inference appropriate for real-time deployment. Owing to its scalable design, the model is well suited to operational use cases such as aerial surveillance and wide-area environmental monitoring. Collectively, these results position GridFormer-VSR as a robust and versatile solution for high-quality VSR.
We present STCDiT, a video super-resolution framework built upon a pre-trained video diffusion model, aiming to restore structurally faithful and temporally stable videos from degraded inputs, even under complex camera motions. The main challenges lie in maintaining temporal stability during reconstruction and preserving structural fidelity during generation. To address these challenges, we first develop a motion-aware VAE reconstruction method that performs segment-wise reconstruction, with each segment clip exhibiting uniform motion characteristic, thereby effectively handling videos with complex camera motions. Moreover, we observe that the first-frame latent extracted by the VAE encoder in each clip, termed the anchor-frame latent, remains unaffected by temporal compression and retains richer spatial structural information than subsequent frame latents. We further develop an anchor-frame guidance approach that leverages structural information from anchor frames to constrain the generation process and improve structural fidelity of video features. Coupling these two designs enables the video diffusion model to achieve high-quality video super-resolution. Extensive experiments show that STCDiT outperforms state-of-the-art methods in terms of structural fidelity and temporal consistency.
With the increasing demand for high-definition video
No abstract available
No abstract available
No abstract available
Video super-resolution (VSR) is important in video processing for reconstructing high-definition image sequences from corresponding continuous and highly-related video frames. However, existing VSR methods have limitations in fusing spatial-temporal information. Some methods only fuse spatial-temporal information on a limited range of total input sequences, while others adopt a recurrent strategy that gradually attenuates the spatial information. While recent advances in VSR utilize Transformer-based methods to improve the quality of the upscaled videos, these methods require significant computational resources to model the long-range dependencies, which dramatically increases the model complexity. To address these issues, we propose a Collaborative Transformer for Video Super-Resolution (CTVSR). The proposed method integrates the strengths of Transformer-based and recurrent-based models by concurrently assimilating the spatial information derived from multi-scale receptive fields and the temporal information acquired from temporal trajectories. In particular, we propose a Spatial Enhanced Network (SEN) with two key components: Token Dropout Attention (TDA) and Deformable Multi-head Cross Attention (DMCA). TDA focuses on the key regions to extract more informative features, and DMCA employs deformable cross attention to gather information from adjacent frames. Moreover, we introduce a Temporal-trajectory Enhanced Network (TEN) that computes the similarity of a given token with temporally-related tokens in the temporal trajectory, which is different from previous methods that evaluate all tokens within the temporal dimension. With comprehensive quantitative and qualitative experiments on four widely-used VSR benchmarks, the proposed CTVSR achieves competitive performance with relatively low computational consumption and high forward speed.
No abstract available
Video Super-Resolution (VSR) aims to restore high-resolution (HR) videos from low-resolution (LR) videos. Existing VSR techniques usually recover HR frames by extracting pertinent textures from nearby frames with known degradation processes. Despite significant progress, grand challenges remain to effectively extract and transmit high-quality textures from high-degraded low-quality sequences, such as blur, additive noises, and compression artifacts. This work proposes a novel degradation-robust Frequency-Transformer (FTVSR++) for handling low-quality videos that carry out self-attention in a combined space-time-frequency domain. First, video frames are split into patches and each patch is transformed into spectral maps in which each channel represents a frequency band. It permits a fine-grained self-attention on each frequency band so that real visual texture can be distinguished from artifacts. Second, a novel dual frequency attention (DFA) mechanism is proposed to capture the global and local frequency relations, which can handle different complicated degradation processes in real-world scenarios. Third, we explore different self-attention schemes for video processing in the frequency domain and discover that a “divided attention” which conducts joint space-frequency attention before applying temporal-frequency attention, leads to the best video enhancement quality. Extensive experiments on three widely-used VSR datasets show that FTVSR++ outperforms state-of-the-art methods on different low-quality videos with clear visual margins.
Space-time video super-resolution (STVSR) is the task of interpolating videos with both Low Frame Rate (LFR) and Low Resolution (LR) to produce High-Frame-Rate (HFR) and also High-Resolution (HR) counterparts. The existing methods based on Convolutional Neural Network (CNN) succeed in achieving visually satisfied results while suffer from slow inference speed due to their heavy architec-tures. We propose to resolve this issue by using a spatial-temporal transformer that naturally incorporates the spa-tial and temporal super resolution modules into a single model. Unlike CNN-based methods, we do not explic-itly use separated building blocks for temporal interpolations and spatial super-resolutions; instead, we only use a single end-to-end transformer architecture. Specifically, a reusable dictionary is built by encoders based on the in-put LFR and LR frames, which is then utilized in the de-coder part to synthesize the HFR and HR frames. compared with the state-of-the-art TMNet [54], our network is 60% smaller (4.5M vs 12.3M parameters) and 80% faster (26.2fps vs 14.3fps on 720 x 576 frames) without sacri-ficing much performance. The source code is available at https://github.com/llmpass/RSTT.
With stereo cameras becoming widely used in invasive surgery systems, stereo endoscopic images provide important depth information for delicate surgical tasks. However, the small size of sensors and their limited lighting conditions lead to low-quality and low-resolution endoscopic images and videos. In this paper, we propose a stereo endoscopic video super-resolution method using transformer with a hybrid attention mechanism named HA-VSR. Stereo video SR aims to reconstruct high-resolution (HR) images from corresponding low-resolution (LR) videos. In our method, the stereo correspondence and temporal correspondence are incorporated into the HA-VSR model. Specifically, the Swin transformer architecture is utilized in proposed framework with hybrid attention mechanisms. The parallel attention mechanism is utilized by using the symmetry and consistency of left and right images, and the temporal attention mechanism is utilized by using the consistency of consecutive frames. Detailed quantitative evaluation and experiments on two datasets show the proposed model achieves advanced SR reconstruction performance, showing that the proposed stereo VSR framework outperforms alternative approaches.
Blind video super-resolution (BVSR) is a low-level vision task which aims to generate high-resolution videos from low-resolution counterparts in unknown degradation scenarios. Existing approaches typically predict blur kernels that are spatially invariant in each video frame or even the entire video. These methods do not consider potential spatio-temporal varying degradations in videos, resulting in suboptimal BVSR performance. In this context, we propose a novel BVSR model based on Implicit Kernels, BVSR-IK, which constructs a multi-scale kernel dictionary parameterized by implicit neural representations. It also employs a newly designed recurrent Transformer to predict the coefficient weights for accurate filtering in both frame correction and feature alignment. Experimental results have demonstrated the effectiveness of the proposed BVSR-IK, when compared with four state-of-the-art BVSR models on three commonly used datasets, with BVSR-IK outperforming the second best approach, FMA-Net, by up to 0.59 dB in PSNR. Source code will be available at https://github.com/QZ1-boy/BVSR-IK.
Compressed video super-resolution (VSR) aims to restore high-resolution frames from compressed low-resolution counterparts. Most recent VSR approaches often enhance an input frame by borrowing relevant textures from neighboring video frames. Although some progress has been made, there are grand challenges to effectively extract and transfer high-quality textures from compressed videos where most frames are usually highly degraded. In this paper, we propose a novel Frequency-Transformer for compressed video super-resolution (FTVSR) that conducts self-attention over a joint space-time-frequency domain. First, we divide a video frame into patches, and transform each patch into DCT spectral maps in which each channel represents a frequency band. Such a design enables a fine-grained level self-attention on each frequency band, so that real visual texture can be distinguished from artifacts, and further utilized for video frame restoration. Second, we study different self-attention schemes, and discover that a divided attention which conducts a joint space-frequency attention before applying temporal attention on each frequency band, leads to the best video enhancement quality. Experimental results on two widely-used video super-resolution benchmarks show that FTVSR outperforms state-of-the-art approaches on both uncompressed and compressed videos with clear visual margins. Code is available at https://github.com/researchmm/FTVSR.
Video restoration has remained an important task in multimedia processing because visual data captured in real environments often contain noise, motion artifacts, and resolution degradation. The demand for high-quality video has increased with the growth of surveillance systems, streaming platforms, and intelligent vision applications. Traditional denoising and super-resolution approaches have relied on spatial filtering and convolutional neural networks. However, these techniques have faced limitations in modeling long-range temporal dependencies across frames. As a result, inconsistent textures, motion blur, and temporal flickering have frequently appeared in restored videos. The present study has addressed these challenges by introducing a Recurrent Optical Flow Transformer (ROFT), a recurrent transformer architecture that has integrated optical flow estimation with temporal attention for joint video denoising and super-resolution. The proposed framework has utilized a recurrent transformer module that has captured temporal correlations between adjacent frames while maintaining spatial consistency. An optical flow estimation unit has guided the alignment of frames, which has reduced motion distortion and misalignment during reconstruction. In addition, a temporal attention mechanism that has analyzed contextual dependencies across multiple frames has enhanced feature representation for dynamic regions. The network has processed sequential frames through recurrent connections that have preserved temporal memory and improved reconstruction stability. Experiments have been conducted on benchmark video restoration datasets that contained noisy and low-resolution sequences. The experimental evaluation demonstrates that the proposed ROFTT framework achieves superior performance compared with existing approaches. The model produces a PSNR value of 35.8 dB and an SSIM value of 0.97, which indicate improved reconstruction quality and structural preservation. The reconstruction error decreases to 0.005 MSE, while the temporal consistency error reduces to 0.007, which confirms stable frame transitions across video sequences. Furthermore, the model achieves an FSIM value of 0.995, which indicates strong preservation of perceptual texture features. These results demonstrate that the proposed architecture effectively integrates optical flow alignment and temporal transformer attention that enhances both spatial detail recovery and temporal coherence in restored video frames.
Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem. Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling. Thus, it seems to be straightforward to apply the vision Transformer to solve VSR. However, the typical block design of Transformer with a fully connected self-attention layer and a token-wise feed-forward layer does not fit well for VSR due to the following two reasons. First, the fully connected self-attention layer neglects to exploit the data locality because this layer relies on linear layers to compute attention maps. Second, the token-wise feed-forward layer lacks the feature alignment which is important for VSR since this layer independently processes each of the input token embeddings without any interaction among them. In this paper, we make the first attempt to adapt Transformer for VSR. Specifically, to tackle the first issue, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information. For the second issue, we design a bidirectional optical flow-based feed-forward layer to discover the correlations across different video frames and also align features. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our proposed method. The code will be available at https://github.com/caojiezhang/VSR-Transformer.
The U-Net architecture has exhibited significant efficacy across various vision tasks, yet its adaptation for Video Super-Resolution (VSR) remains underexplored. While the Video Restoration Transformer (VRT) introduced U-Net into the VSR domain, it poses challenges due to intricate design and substantial computational overhead. In this paper, we present VMG, a streamlined framework tailored for VSR. Through empirical analysis, we identify the crucial stages of the U-Net architecture contributing to performance enhancement in VSR tasks. Our optimized architecture substantially reduces model parameters and complexity while improving performance. Additionally, we introduce two key modules, namely the Gated MLP-like Mixer (GMM) and the Flow-Guided cross-attention Mixer (FGM), designed to enhance spatial and temporal feature aggregation. GMM dynamically encodes spatial correlations with linear complexity in space and time, and FGM leverages optical flow to capture motion variation and implement sparse attention to efficiently aggregate temporally related information. Extensive experiments demonstrate that VMG achieves nearly 70% reduction in GPU memory usage, 30% fewer parameters, and 10% lower computational complexity (FLOPs) compared to VRT, while yielding highly competitive or superior results across four benchmark datasets. Qualitative assessments reveal VMG’s ability to preserve remarkable details and sharp structures in the reconstructed videos. The code and pre-trained models are available at https://github.com/EasyVision-Ton/VMG.
This paper presents a general-purpose video super-resolution (VSR) method, dubbed VSR-HE, specifically designed to enhance the perceptual quality of compressed content. Targeting scenarios characterized by heavy compression, the method upscales low-resolution videos by a ratio of four, from 180p to 720p or from 270p to 1080p. VSR-HE adopts hierarchical encoding transformer blocks and has been sophisticatedly optimized to eliminate a wide range of compression artifacts commonly introduced by H.265/HEVC encoding across various quantization parameter (QP) levels. To ensure robustness and generalization, the model is trained and evaluated under diverse compression settings, allowing it to effectively restore fine-grained details and preserve visual fidelity. The proposed VSR-HE has been officially submitted to the ICME 2025 Grand Challenge on VSR for Video Conferencing (Team BVI-VSR), under both the Track 1 (General-Purpose Real-World Video Content) and Track 2 (Talking Head Videos).
Continuous space-time video super-resolution (C-STVSR) has garnered increasing interest for its capability to reconstruct high-resolution and high-frame-rate videos at arbitrary spatial and temporal scales. However, prevailing methods often generalize poorly, producing unsatisfactory results when applied to out-of-distribution (OOD) scales. To overcome this limitation, we present EvEnhancer, a novel approach that marries the unique properties of high temporal resolution and high dynamic range encapsulated in event streams to achieve robust and generalizable C-STVSR. Our approach incorporates event-adapted synthesis that capitalizes on the spatiotemporal correlations between frames and events to capture long-term motion trajectories, enabling adaptive interpolation and fusion across space and time. This is then coupled with a local implicit video transformer that integrates local implicit video neural function with cross-scale spatiotemporal attention to learn continuous video representations and generate plausible videos at arbitrary resolutions and frame rates. We further develop EvEnhancerPlus, which builds a controllable switching mechanism that dynamically determines the reconstruction difficulty for each spatiotemporal pixel based on local event statistics. This allows the model to adaptively route reconstruction along the most suitable pathways at a fine-grained pixel level, substantially reducing computational overhead while maintaining excellent performance. Furthermore, we devise a cross-derivative training strategy that stabilizes the convergence of such a multi-pathway framework through staged cross-optimization. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both synthetic and real-world datasets, while maintaining superior generalizability at OOD scales. The code is available at https://github.com/W-Shuoyan/EvEnhancerPlus.
In this work, we rethink the approach to video super-resolution by introducing a method based on the Diffusion Posterior Sampling framework, combined with an unconditional video diffusion transformer operating in latent space. The video generation model, a diffusion transformer, functions as a spacetime model. We argue that a powerful model, which learns the physics of the real world, can easily handle various kinds of motion patterns as prior knowledge, thus eliminating the need for explicit estimation of optical flows or motion parameters for pixel alignment. Furthermore, a single instance of the proposed video diffusion transformer model can adapt to different sampling conditions without re-training. Empirical results on synthetic and real-world datasets illustrate the feasibility of diffusion-based, alignment-free video super-resolution.
Video super-resolution remains a major challenge in low-level vision tasks. To date, CNN- and Transformer-based methods have delivered impressive results. However, CNNs are limited by local receptive fields, while Transformers struggle with quadratic complexity, posing challenges for processing long sequences in VSR. Recently, Mamba has drawn attention for its long-sequence modeling, linear complexity, and large receptive fields. In this work, we propose VSRM, a novel \textbf{V}ideo \textbf{S}uper-\textbf{R}esolution framework that leverages the power of \textbf{M}amba. VSRM introduces Spatial-to-Temporal Mamba and Temporal-to-Spatial Mamba blocks to extract long-range spatio-temporal features and enhance receptive fields efficiently. To better align adjacent frames, we propose Deformable Cross-Mamba Alignment module. This module utilizes a deformable cross-mamba mechanism to make the compensation stage more dynamic and flexible, preventing feature distortions. Finally, we minimize the frequency domain gaps between reconstructed and ground-truth frames by proposing a simple yet effective Frequency Charbonnier-like loss that better preserves high-frequency content and enhances visual quality. Through extensive experiments, VSRM achieves state-of-the-art results on diverse benchmarks, establishing itself as a solid foundation for future research.
The tradeoff between reconstruction quality and compute required for video super-resolution (VSR) remains a formidable challenge in its adoption for deployment on resource-constrained edge devices. While transformer-based VSR models have set new benchmarks for reconstruction quality in recent years, these require substantial computational resources. On the other hand, lightweight models that have been introduced even recently struggle to deliver state-of-the-art reconstruction. We propose a novel lightweight and parameter-efficient neural architecture for VSR that achieves state-of-the-art reconstruction accuracy with just 2.3 million parameters. Our model enhances information utilization based on several architectural attributes. Firstly, it uses 2D wavelet decompositions strategically interlayered with learnable convolutional layers to utilize the inductive prior of spatial sparsity of edges in visual data. Secondly, it uses a single memory tensor to capture inter-frame temporal information while avoiding the computational cost of previous memory-based schemes. Thirdly, it uses residual deformable convolutions for implicit inter-frame object alignment that improve upon deformable convolutions by enhancing spatial information in inter-frame feature differences. Architectural insights from our model can pave the way for real-time VSR on the edge, such as display devices for streaming data.
In this paper, we explored the space-time video super-resolution task, which aims to generate high frame rate (HFR) and high resolution (HR) videos from low frame rate (LFR) and low resolution (LR) videos. Most of the existing space-time video super-resolution methods simply combine two sub-tasks of video interpolation (VFI) and video super-resolution (VSR). And these methods usually use recursive propagation structures, but their structures are complex, very time-consuming, and do not make full use of feature information. To address these problems, we proposed a single-stage space-time super-resolution architecture that is based on the Swin Transformer and second-order network propagation. The Swin Transformer allows a natural combination of two subtasks into a single task, and then the second-order network propagation enhances information propagation and efficiently utilizes the information of all the input video frames. We also introduced a dataset pre-cleaning module, which can not only alleviate the image degradation before being propagated, but also suppress the artifacts in the model output, and improve the reconstruction performance of the proposed model. The experimental results show that compared with the related two-stage network, our proposed model is lighter and the reasoning speed is faster with competitive performance.
Video super-resolution is the task of converting low-resolution video to high-resolution video. Existing methods with better intuitive effects are mainly based on convolutional neural networks (CNNs), but the architecture is heavy, resulting in a slow inference structure. Aiming at this problem, this paper proposes a real-time video super-resolution. Real-time video super resolution transformer (RVSRT) can quickly complete the super-resolution task while considering the visual fluency of video frame switching. Unlike traditional methods based on CNNs, this paper does not process video frames separately with different network modules in the temporal domain, but batches adjacent frames through a single UNet-style structure end-to-end Transformer network architecture. Moreover, this paper creatively sets up two-stage interpolation sampling before and after the end-to-end network to maximize the performance of the traditional CV algorithm. The experimental results show that compared with SOTA TMNet, RVSRT has only 50% of the network size (6.1M vs 12.3M, parameters) while ensuring comparable performance, and the speed is increased by 80% (26.2 fps vs 14.3 fps, frame size is 720*576).
Cloud Video Surveillance (CVS) systems, as the backbone of distributed surveillance networks, face increasing challenges in transmitting large volumes of high-resolution video data. While cloud-end collaborative video transmission methods can reduce bitrates beyond conventional compression techniques, the periodic transmission of high-resolution keyframes consumes significant bandwidth due to semantically irrelevant background information. To address this, we propose a cloud-edge-end collaborative video transmission scheme based on object-guided video super-resolution. In this scheme, the edge extracts key objects from sparsely selected keyframes at the end and transmits them along with low-resolution video to the cloud, where our Keyframe-Guided Video Restoration Transformer (KG-VRT) is used to improve the video quality. Experimental results on public datasets show that our network outperforms state-of-the-art keyframe-based baselines with a 1.73 dB PSNR improvement and maintains robust performance even with keyframe intervals of up to 30 frames. A comparative analysis of two transmission strategies—transmitting full keyframes versus transmitting only key object regions—demonstrates a 60% – 80% reduction in keyframe bitrate while maintaining object detection accuracy at a significantly reduced overall system bitrate. This highlights the efficiency and scalability of our approach in bandwidth-constrained surveillance scenarios.
Existing video super-resolution (SR) algorithms usually assume that the blur kernels in the degradation process are known and do not model the blur kernels in the restoration. However, this assumption does not hold for blind video SR and usually leads to over-smoothed super-resolved frames. In this paper, we propose an effective blind video SR algorithm based on deep convolutional neural networks (CNNs). Our algorithm first estimates blur kernels from low-resolution (LR) input videos. Then, with the estimated blur kernels, we develop an effective image deconvolution method based on the image formation model of blind video SR to generate intermediate latent frames so that sharp image contents can be restored well. To effectively explore the information from adjacent frames, we estimate the motion fields from LR input videos, extract features from LR videos by a feature extraction network, and warp the extracted features from LR inputs based on the motion fields. Moreover, we develop an effective sharp feature exploration method which first extracts sharp features from restored intermediate latent frames and then uses a transformation operation based on the extracted sharp features and warped features from LR inputs to generate better features for HR video restoration. We formulate the proposed algorithm into an end-to-end trainable framework and show that it performs favorably against state-of-the-art methods.
Existing deep learning-based video super-resolution (SR) methods usually depend on the supervised learning approach, where the training data is usually generated by the blurring operation with known or predefined kernels (e.g., Bicubic kernel) followed by a decimation operation. However, this does not hold for real applications as the degradation process is complex and cannot be approximated by these idea cases well. Moreover, obtaining high-resolution (HR) videos and the corresponding low-resolution (LR) ones in real-world scenarios is difficult. To overcome these problems, we propose a self-supervised learning method to solve the blind video SR problem, which simultaneously estimates blur kernels and HR videos from the LR videos. As directly using LR videos as supervision usually leads to trivial solutions, we develop a simple and effective method to generate auxiliary paired data from original LR videos according to the image formation of video SR, so that the networks can be better constrained by the generated paired data for both blur kernel estimation and latent HR video restoration. In addition, we introduce an optical flow estimation module to exploit the information from adjacent frames for HR video restoration. Experiments show that our method performs favorably against state-of-the-art ones on benchmarks and real-world videos.
Video super-resolution (VSR) techniques, especially deep-learning-based algorithms, have drastically improved over the last few years and shown impressive performance on synthetic data. However, their performance on real-world video data suffers because of the complexity of real-world degradations and misaligned video frames. Since obtaining a synthetic dataset consisting of low-resolution (LR) and high-resolution (HR) frames are easier than obtaining real-world LR and HR images, in this paper, we propose synthesizing real-world degradations on synthetic training datasets. The proposed synthetic real-world degradations (SRWD) include a combination of the blur, noise, down-sampling, pixel binning, and image and video compression artifacts. We then propose using a random shuffling-based strategy to simulate these degradations on the training datasets and train a single end-to-end deep neural network (DNN) on the proposed larger variation of realistic synthesized training data. Our quantitative and qualitative comparative analysis shows that the proposed training strategy using diverse realistic degradations improves the performance by 7.1 % in terms of NRQM compared to RealBasicVSR and by 3.34 % compared to BSRGAN on the VideoLQ dataset. We also introduce a new dataset that contains high-resolution real-world videos that can serve as a common ground for bench-marking.
No abstract available
Most conventional supervised super-resolution (SR) algorithms assume that low-resolution (LR) data is obtained by downscaling high-resolution (HR) data with a fixed known kernel, but such an assumption often does not hold in real scenarios. Some recent blind SR algorithms have been proposed to estimate different downscaling kernels for each input LR image. However, they suffer from heavy computational overhead, making them infeasible for direct application to videos. In this work, we present DynaVSR, a novel meta-learning-based framework for real-world video SR that enables efficient downscaling model estimation and adaptation to the current input. Specifically, we train a multi-frame downscaling module with various types of synthetic blur kernels, which is seamlessly combined with a video SR network for input-aware adaptation. Experimental results show that DynaVSR consistently improves the performance of the state-of-the-art video SR models by a large margin, with an order of magnitude faster inference time compared to the existing blind SR approaches.
Deep learning-based blind super-resolution (SR) methods have recently achieved unprecedented performance in upscaling frames with unknown degradation. These models are able to accurately estimate the unknown downscaling kernel from a given low-resolution (LR) image in order to leverage the kernel during restoration. Although these approaches have largely been successful, they are predominantly image-based and therefore do not exploit the temporal properties of the kernels across multiple video frames. In this paper, we investigated the temporal properties of the kernels and highlighted its importance in the task of blind video super-resolution. Specifically, we measured the kernel temporal consistency of real-world videos and illustrated how the estimated kernels might change per frame in videos of varying dynamicity of the scene and its objects. With this new insight, we revisited previous popular video SR approaches, and showed that previous assumptions of using a fixed kernel throughout the restoration process can lead to visual artifacts when upscaling real-world videos. In order to counteract this, we tailored existing single-image and video SR techniques to leverage kernel consistency during both kernel estimation and video upscaling processes. Extensive experiments on synthetic and real-world videos show substantial restoration gains quantitatively and qualitatively, achieving the new state-of-the-art in blind video SR and underlining the potential of exploiting kernel temporal consistency.
Recent efforts have witnessed remarkable progress in satellite video super-resolution (SVSR). However, most SVSR methods usually assume the degradation is fixed and known, e.g., bicubic downsampling, which makes them vulnerable in real-world scenes with multiple and unknown degradations. To alleviate this issue, blind SR has, thus, become a research hotspot. Nevertheless, the existing approaches are mainly engaged in blur kernel estimation while losing sight of another critical aspect for VSR tasks: temporal compensation, especially compensating for blurry and smooth pixels with vital sharpness from severely degraded satellite videos. Therefore, this article proposes a practical blind SVSR algorithm (BSVSR) to explore more sharp cues by considering the pixelwise blur levels in a coarse-to-fine manner. Specifically, we employed multiscale deformable (MSD) convolution to coarsely aggregate the temporal redundancy into adjacent frames by window-slid progressive fusion. Then, the adjacent features are finely merged into mid-feature using deformable attention (DA), which measures the blur levels of pixels and assigns more weights to the informative pixels, thus inspiring the representation of sharpness. Moreover, we devise a pyramid spatial transformation (PST) module to adjust the solution space of sharp mid-feature, resulting in flexible feature adaptation in multilevel domains. Quantitative and qualitative evaluations on both simulated and real-world satellite videos demonstrate that our BSVSR performs favorably against state-of-the-art nonblind and blind SR models. Code will be available at https://github.com/XY-boy/Blind-Satellite-VSR.
Super-resolution (SR) of satellite video has long been a critical research direction in the field of remote sensing video processing and analysis, and blind SR has attracted increasing attention in the face of satellite video with unknown degradation. However, existing blind SR methods mainly focus on accurate blur kernel estimation, while ignoring the importance of interframe information compensation in the time domain. Therefore, this article focuses on precise temporal information compensation and proposes a blind SR network based on interframe information compensation. First, we propose a multiscale parallel convolution block to alleviate the difficulty of alignment between satellite video frames due to the presence of moving objects of different scales. Second, we propose a hybrid attention-based feature extraction module that effectively extracts both local and global information between video frames. While activating more pixels, more attention is allocated to informative pixels to obtain the clean features. Finally, a pyramid space activation module is proposed to gradually adjust the clean features through a multilayer iterative pyramid structure, enabling the clean features to better perceive blur and achieve pixel-level fine compensation for unknown degraded frames. Extensive experiments on real satellite video datasets demonstrate that our method is superior to state-of-the-art non-blind and blind SR methods, both qualitatively and quantitatively.
Satellite video images contain temporal contextual information that is unavailable in single-frame images. Therefore, using a sequence of frames for super-resolution can significantly enhance the reconstruction effect. However, most existing satellite Video Super-Resolution (VSR) methods focus on improving the network’s presentation ability, overlooking the complex degradation processes present in real-world satellite videos which appear as a blind SR problem. In this paper, we propose an effective satellite VSR method based on a unidirectional recurrent network named URD-VSR. Simultaneously, a network independent of the SR structure is utilized to model the degradation process. Experiments on real satellite video datasets and integration with object detection demonstrate the effectiveness of the proposed method.
Video super-resolution (VSR) on mobile devices aims to restore high-resolution frames from their low-resolution counterparts, satisfying the requirements of performance, FLOPs and latency. On one hand, partial feature processing, as a classic and acknowledged strategy, is developed in current studies to reach an appropriate trade-off between FLOPs and accuracy. However, the splitting of partial feature processing strategy are usually performed in a blind manner, thereby reducing the computational efficiency and performance gains. On the other hand, current methods for mobile platforms primarily treat VSR as an extension of single-image super-resolution to reduce model calculation and inference latency. However, lacking inter-frame information interaction in current methods results in a suboptimal latency and accuracy trade-off. To this end, we propose a novel architecture, termed Feature Aggregating Network with Inter-frame Interaction (FANI), a lightweight yet considering frame-wise correlation VSR network, which could achieve real-time inference while maintaining superior performance. Our FANI accepts adjacent multi-frame low-resolution images as input and generally consists of several fully-connection-embedded modules, i.e., Multi-stage Partial Feature Distillation (MPFD) for capturing multi-level feature representations. Moreover, considering the importance of inter-frame alignment, we further employ a tiny Attention-based Frame Alignment (AFA) module to promote inter-frame information flow and aggregation efficiently. Extensive experiments on the well-known dataset and real-world mobile device demonstrate the superiority of our proposed FANI, which means that our FANI could be well adapted to mobile devices and produce visually pleasing results.
No abstract available
Ultrasound imaging is widely applied in clinical practice, yet ultrasound videos often suffer from low signal-to-noise ratios (SNR) and limited resolutions, posing challenges for diagnosis and analysis. Variations in equipment and acquisition settings can further exacerbate differences in data distribution and noise levels, reducing the generalizability of pre-trained models. This work presents a self-supervised ultrasound video super-resolution algorithm called Deep Ultrasound Prior (DUP). DUP employs a video-adaptive optimization process of a neural network that enhances the resolution of given ultrasound videos without requiring paired training data while simultaneously removing noise. Quantitative and visual evaluations demonstrate that DUP outperforms existing super-resolution algorithms, leading to substantial improvements for downstream applications.
Most of the existing video face super-resolution (VFSR) methods are trained and evaluated on VoxCeleb1, which is designed specifically for speaker identification and the frames in this dataset are of low quality. As a consequence, the VFSR models trained on this dataset can not output visual-pleasing results. In this paper, we develop an automatic and scalable pipeline to collect a high-quality video face dataset (VFHQ), which contains over 16, 000 high-fidelity clips of diverse interview scenarios. To verify the necessity of VFHQ, we further conduct experiments and demonstrate that VFSR models trained on our VFHQ dataset can generate results with sharper edges and finer textures than those trained on VoxCeleb1. In addition, we show that the temporal information plays a pivotal role in eliminating video consistency issues as well as further improving visual performance. Based on VFHQ, by analyzing the benchmarking study of several state-of-the-art algorithms under bicubic and blind settings.
When the unknown degradation is mixed with unknown blurry kernels, how to perform super-resolution operation is an open issue. The mean idea of the existing zero-shot and non-zero-shot methods is to estimate blurry kernel. The effects of these methods depend on the accuracy of the deduced blurry kernel. In this paper, we propose Randomly initialized Zero-Shot Super-Resolution (RZSR) training strategy. RZSR is a zero-shot training method and it allows the network to extract low-resolution image features and generate its counterpart high-resolution images under the interference of degradation algorithms. We further propose two model-agnostic modules which are Adaptive Information Extraction Module (AIEM) and knowledge dictionary. They respectively assist the network to extract features and well fit the data distribution of clear images. RZSR can be applied to any single image super-resolution and video super-resolution models. We prove the generalization ability and superiority of RZSR through a series of experiments.
Super resolution image reconstruction under unknown Gaussian blur has been a challenging topic. Advanced optimization-based works for blind image super-resolution (SR) were reported to be effective, but there exist both large data space storage and time consuming due to vector-variable optimization. This paper proposes a matrix-variable optimization method for fast blind image SR. We first present an accurate blur kernel estimation-based matrix decomposition method. Then we propose minimizing a matrix-variable optimization problem with sparse representation and TV regularization terms. The proposed method can exactly estimate the unknown blur kernel and blur matrix. Compared with vector-variable optimization based methods for blind image SR, the proposed method can greatly reduce their data space storage and computation time. Compared with deep learning methods, the proposed method can directly deal with multiframe SR problem without training and learning task. Experimental results show that the proposed algorithm is superior to conventional optimization-based method in terms of solution quality and computation time. Moreover, the proposed method can obtain higher reconstruction quality than the deep learning methods, specially in the case of large blur kernels.
No abstract available
No abstract available
Recently, there have been significant advances in video super-resolution (VSR) techniques under blind and practical degradation settings. These techniques restore the fine details of each video frame while maintaining the temporal consistency between frames for a smooth motion. Unfortunately, many attempts still fall short in the case of real-world videos. When diverse and complex in-the-wild degradation is introduced, the task becomes non-trivial and challenging. As a result, VSR techniques perform poorly in general. We argue that there is more space to improve the performance of VSR methods, as current methods are only trained on image-level degradation settings, leading to a restoration quality that may be sub-optimal for real-world degradation that varies pixel-wise within an image. To this end, we propose RealPixVSR which leverages the pixel-level representations to improve the pixel-level sensitivity to degradation. The pixel-level content-invariant degradation representation is learned in a self-supervised manner using the contrastive learning network referred to as the Pixel-Degradation-Representation-Network (PDRN). And the learned visual representation is merged with the cleaning and restoration networks using the Pixel-Degradation-Informed-Block (PDIB). Through experiments, we show that our network outperforms the latest state-of-the-art VSR models for real-world video.
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
Super-Resolution (SR) is a fundamental computer vision task that aims to obtain a high-resolution clean image from the given low-resolution counterpart. This paper reviews the NTIRE 2021 Challenge on Video Super-Resolution. We present evaluation results from two competition tracks as well as the proposed solutions. Track 1 aims to develop conventional video SR methods focusing on the restoration quality. Track 2 assumes a more challenging environment with lower frame rates, casting spatio-temporal SR problem. In each competition, 247 and 223 participants have registered, respectively. During the final testing phase, 14 teams competed in each track to achieve state-of-the-art performance on video SR tasks.
In this paper, we address the space-time video super-resolution, which aims at generating a high-resolution (HR) slow-motion video from a low-resolution (LR) and low frame rate (LFR) video sequence. A naïve method is to decompose it into two sub-tasks: video frame interpolation (VFI) and video super-resolution (VSR). Nevertheless, temporal interpolation and spatial upscaling are intra-related in this problem. Two-stage approaches cannot fully make use of this natural property. Besides, state-of-the-art VFI or VSR deep networks usually have a large frame reconstruction module in order to obtain high-quality photo-realistic video frames, which makes the two-stage approaches have large models and thus be relatively time-consuming. To overcome the issues, we present a one-stage space-time video super-resolution framework, which can directly reconstruct an HR slow-motion video sequence from an input LR and LFR video. Instead of reconstructing missing LR intermediate frames as VFI models do, we temporally interpolate LR frame features of the missing LR frames capturing local temporal contexts by a feature temporal interpolation module. Extensive experiments on widely used benchmarks demonstrate that the proposed framework not only achieves better qualitative and quantitative performance on both clean and noisy LR frames but also is several times faster than recent state-of-the-art two-stage networks. The source code is released in https://github.com/Mukosame/Zooming-Slow-Mo-CVPR-2020 .
The capability of video super-resolution (VSR) to synthesize high-resolution (HR) video from ideal datasets has been demonstrated in many works. However, applying the VSR model to real-world video with unknown and complex degradation remains a challenging task. First, existing degradation metrics in most VSR methods are not able to effectively simulate real-world noise and blur. On the contrary, simple combinations of classical degradation are used for real-world noise modeling, which led to the VSR model often being violated by out-of-distribution noise. Second, many SR models focus on noise simulation and transfer. Nevertheless, the sampled noise is monotonous and limited. To address the aforementioned problems, we propose a Negatives augmentation strategy for generalized noise modeling in Video Super-Resolution (NegVSR) task. Specifically, we first propose sequential noise generation toward real-world data to extract practical noise sequences. Then, the degeneration domain is widely expanded by negative augmentation to build up various yet challenging real-world noise sets. We further propose the augmented negative guidance loss to learn robust features among augmented negatives effectively. Extensive experiments on real-world datasets (e.g., VideoLQ and FLIR) show that our method outperforms state-of-the-art methods with clear margins, especially in visual quality. Project page is available at: https://negvsr.github.io/.
Video restoration (e.g., video super-resolution) aims to restore high-quality frames from low-quality frames. Different from single image restoration, video restoration generally requires to utilize temporal information from multiple adjacent but usually misaligned video frames. Existing deep methods generally tackle with this by exploiting a sliding window strategy or a recurrent architecture, which either is restricted by frame-by-frame restoration or lacks long-range modelling ability. In this paper, we propose a Video Restoration Transformer (VRT) with parallel frame prediction and long-range temporal dependency modelling abilities. More specifically, VRT is composed of multiple scales, each of which consists of two kinds of modules: temporal mutual self attention (TMSA) and parallel warping. TMSA divides the video into small clips, on which mutual attention is applied for joint motion estimation, feature alignment and feature fusion, while self attention is used for feature extraction. To enable cross-clip interactions, the video sequence is shifted for every other layer. Besides, parallel warping is used to further fuse information from neighboring frames by parallel feature warping. Experimental results on five tasks, including video super-resolution, video deblurring, video denoising, video frame interpolation and space-time video super-resolution, demonstrate that VRT outperforms the state-of-the-art methods by large margins ($\textbf{up to 2.16dB}$) on fourteen benchmark datasets.
Video super-resolution (VSR) is the task of restoring high-resolution frames from a sequence of low-resolution inputs. Different from single image super-resolution, VSR can utilize frames' temporal information to reconstruct results with more details. Recently, with the rapid development of convolution neural networks (CNN), the VSR task has drawn increasing attention and many CNN-based methods have achieved remarkable results. However, only a few VSR approaches can be applied to real-world mobile devices due to the computational resources and runtime limitations. In this paper, we propose a \textit{Sliding Window based Recurrent Network} (SWRN) which can be real-time inference while still achieving superior performance. Specifically, we notice that video frames should have both spatial and temporal relations that can help to recover details, and the key point is how to extract and aggregate information. Address it, we input three neighboring frames and utilize a hidden state to recurrently store and update the important temporal information. Our experiment on REDS dataset shows that the proposed method can be well adapted to mobile devices and produce visually pleasant results.
This paper reviews the NTIRE2021 challenge on burst super-resolution. Given a RAW noisy burst as input, the task in the challenge was to generate a clean RGB image with 4 times higher resolution. The challenge contained two tracks; Track 1 evaluating on synthetically generated data, and Track 2 using real-world bursts from mobile camera. In the final testing phase, 6 teams submitted results using a diverse set of solutions. The top-performing methods set a new state-of-the-art for the burst super-resolution task.
This paper explores an efficient solution for Space-time Super-Resolution, aiming to generate High-resolution Slow-motion videos from Low Resolution and Low Frame rate videos. A simplistic solution is the sequential running of Video Super Resolution and Video Frame interpolation models. However, this type of solutions are memory inefficient, have high inference time, and could not make the proper use of space-time relation property. To this extent, we first interpolate in LR space using quadratic modeling. Input LR frames are super-resolved using a state-of-the-art Video Super-Resolution method. Flowmaps and blending mask which are used to synthesize LR interpolated frame is reused in HR space using bilinear upsampling. This leads to a coarse estimate of HR intermediate frame which often contains artifacts along motion boundaries. We use a refinement network to improve the quality of HR intermediate frame via residual learning. Our model is lightweight and performs better than current state-of-the-art models in REDS STSR Validation set.
As a fundamental challenge in visual computing, video super-resolution (VSR) focuses on reconstructing highdefinition video sequences from their degraded lowresolution counterparts. While deep convolutional neural networks have demonstrated state-of-the-art performance in spatial-temporal super-resolution tasks, their computationally intensive nature poses significant deployment challenges for resource-constrained edge devices, particularly in real-time mobile video processing scenarios where power efficiency and latency constraints coexist. In this work, we propose a Reparameterizable Architecture for High Fidelity Video Super Resolution method, named RepNet-VSR, for real-time 4x video super-resolution. On the REDS validation set, the proposed model achieves 27.79 dB PSNR when processing 180p to 720p frames in 103 ms per 10 frames on a MediaTek Dimensity NPU. The competition results demonstrate an excellent balance between restoration quality and deployment efficiency. The proposed method scores higher than the previous champion algorithm of MAI video super-resolution challenge.
In this paper, we consider the task of space-time video super-resolution (ST-VSR), which can increase the spatial resolution and frame rate for a given video simultaneously. Despite the remarkable progress of recent methods, most of them still suffer from high computational costs and inefficient long-range information usage. To alleviate these problems, we propose a Bidirectional Recurrence Network (BRN) with the optical-flow-reuse strategy to better use temporal knowledge from long-range neighboring frames for high-efficiency reconstruction. Specifically, an efficient and memory-saving multi-frame motion utilization strategy is proposed by reusing the intermediate flow of adjacent frames, which considerably reduces the computation burden of frame alignment compared with traditional LSTM-based designs. In addition, the proposed hidden state in BRN is updated by the reused optical flow and refined by the Feature Refinement Module (FRM) for further optimization. Moreover, by utilizing intermediate flow estimation, the proposed method can inference non-linear motion and restore details better. Extensive experiments demonstrate that our optical-flow-reuse-based bidirectional recurrent network (OFR-BRN) is superior to state-of-the-art methods in accuracy and efficiency.
Real-world video super-resolution (VSR) presents significant challenges due to complex and unpredictable degradations. Although some recent methods utilize image diffusion models for VSR and have shown improved detail generation capabilities, they still struggle to produce temporally consistent frames. We attempt to use Stable Video Diffusion (SVD) combined with ControlNet to address this issue. However, due to the intrinsic image-animation characteristics of SVD, it is challenging to generate fine details using only low-quality videos. To tackle this problem, we propose DAM-VSR, an appearance and motion disentanglement framework for VSR. This framework disentangles VSR into appearance enhancement and motion control problems. Specifically, appearance enhancement is achieved through reference image super-resolution, while motion control is achieved through video ControlNet. This disentanglement fully leverages the generative prior of video diffusion models and the detail generation capabilities of image super-resolution models. Furthermore, equipped with the proposed motion-aligned bidirectional sampling strategy, DAM-VSR can conduct VSR on longer input videos. DAM-VSR achieves state-of-the-art performance on real-world data and AIGC data, demonstrating its powerful detail generation capabilities.
A large number of image super resolution algorithms based on the sparse coding are proposed, and some algorithms realize the multi-frame super resolution. In multi-frame super resolution based on the sparse coding, both accurate image registration and sparse coding are required. Previous study on multi-frame super resolution based on sparse coding firstly apply block matching for image registration, followed by sparse coding to enhance the image resolution. In this paper, these two problems are solved by optimizing a single objective function. The results of numerical experiments support the effectiveness of the proposed approch.
Light field cameras capture the 3D information in a scene with a single exposure. This special feature makes light field cameras very appealing for a variety of applications: from post-capture refocus, to depth estimation and image-based rendering. However, light field cameras suffer by design from strong limitations in their spatial resolution, which should therefore be augmented by computational methods. On the one hand, off-the-shelf single-frame and multi-frame super-resolution algorithms are not ideal for light field data, as they do not consider its particular structure. On the other hand, the few super-resolution algorithms explicitly tailored for light field data exhibit significant limitations, such as the need to estimate an explicit disparity map at each view. In this work we propose a new light field super-resolution algorithm meant to address these limitations. We adopt a multi-frame alike super-resolution approach, where the complementary information in the different light field views is used to augment the spatial resolution of the whole light field. We show that coupling the multi-frame approach with a graph regularizer, that enforces the light field structure via nonlocal self similarities, permits to avoid the costly and challenging disparity estimation step for all the views. Extensive experiments show that the new algorithm compares favorably to the other state-of-the-art methods for light field super-resolution, both in terms of PSNR and visual quality.
This work addresses the Burst Super-Resolution (BurstSR) task using a new architecture, which requires restoring a high-quality image from a sequence of noisy, misaligned, and low-resolution RAW bursts. To overcome the challenges in BurstSR, we propose a Burst Super-Resolution Transformer (BSRT), which can significantly improve the capability of extracting inter-frame information and reconstruction. To achieve this goal, we propose a Pyramid Flow-Guided Deformable Convolution Network (Pyramid FG-DCN) and incorporate Swin Transformer Blocks and Groups as our main backbone. More specifically, we combine optical flows and deformable convolutions, hence our BSRT can handle misalignment and aggregate the potential texture information in multi-frames more efficiently. In addition, our Transformer-based structure can capture long-range dependency to further improve the performance. The evaluation on both synthetic and real-world tracks demonstrates that our approach achieves a new state-of-the-art in BurstSR task. Further, our BSRT wins the championship in the NTIRE2022 Burst Super-Resolution Challenge.
Camera pipelines receive raw Bayer-format frames that need to be denoised, demosaiced, and often super-resolved. Multiple frames are captured to utilize natural hand tremors and enhance resolution. Multi-frame super-resolution is therefore a fundamental problem in camera pipelines. Existing adversarial methods are constrained by the quality of ground truth. We propose GenMFSR, the first Generative Multi-Frame Raw-to-RGB Super Resolution pipeline, that incorporates image priors from foundation models to obtain sub-pixel information for camera ISP applications. GenMFSR can align multiple raw frames, unlike existing single-frame super-resolution methods, and we propose a loss term that restricts generation to high-frequency regions in the raw domain, thus preventing low-frequency artifacts.
The objective of image super-resolution is to reconstruct a high-resolution (HR) image with the prior knowledge from one or several low-resolution (LR) images. However, in the real world, due to the limited complementary information, the performance of both single-frame and multi-frame super-resolution reconstruction degrades rapidly as the magnification increases. In this paper, we propose a novel two-step image super resolution method concatenating multi-frame super-resolution (MFSR) with single-frame super-resolution (SFSR), to progressively upsample images to the desired resolution. The proposed method consisting of an L0-norm constrained reconstruction scheme and an enhanced residual back-projection network, integrating the flexibility of the variational modelbased method and the feature learning capacity of the deep learning-based method. To verify the effectiveness of the proposed algorithm, extensive experiments with both simulated and real world sequences were implemented. The experimental results show that the proposed method yields superior performance in both objective and perceptual quality measurements. The average PSNRs of the cascade model in set5 and set14 are 33.413 dB and 29.658 dB respectively, which are 0.76 dB and 0.621 dB more than the baseline method. In addition, the experiment indicates that this cascade model can be robustly applied to different SFSR and MFSR methods.
We introduce the notion of point affiliation into feature upsampling. By abstracting a feature map into non-overlapped semantic clusters formed by points of identical semantic meaning, feature upsampling can be viewed as point affiliation -- designating a semantic cluster for each upsampled point. In the framework of kernel-based dynamic upsampling, we show that an upsampled point can resort to its low-res decoder neighbors and high-res encoder point to reason the affiliation, conditioned on the mutual similarity between them. We therefore present a generic formulation for generating similarity-aware upsampling kernels and prove that such kernels encourage not only semantic smoothness but also boundary sharpness. This formulation constitutes a novel, lightweight, and universal upsampling solution, Similarity-Aware Point Affiliation (SAPA). We show its working mechanism via our preliminary designs with window-shape kernel. After probing the limitations of the designs on object detection, we reveal additional insights for upsampling, leading to SAPA with the dynamic kernel shape. Extensive experiments demonstrate that SAPA outperforms prior upsamplers and invites consistent performance improvements on a number of dense prediction tasks, including semantic segmentation, object detection, instance segmentation, panoptic segmentation, image matting, and depth estimation. Code is made available at: https://github.com/tiny-smart/sapa
Most recent works on optical flow use convex upsampling as the last step to obtain high-resolution flow. In this work, we show and discuss several issues and limitations of this currently widely adopted convex upsampling approach. We propose a series of changes, in an attempt to resolve current issues. First, we propose to decouple the weights for the final convex upsampler, making it easier to find the correct convex combination. For the same reason, we also provide extra contextual features to the convex upsampler. Then, we increase the convex mask size by using an attention-based alternative convex upsampler; Transformers for Convex Upsampling. This upsampler is based on the observation that convex upsampling can be reformulated as attention, and we propose to use local attention masks as a drop-in replacement for convex masks to increase the mask size. We provide empirical evidence that a larger mask size increases the likelihood of the existence of the convex combination. Lastly, we propose an alternative training scheme to remove bilinear interpolation artifacts from the model output. Our proposed ideas could theoretically be applied to almost every current state-of-the-art optical flow architecture. On the FlyingChairs + FlyingThings3D training setting we reduce the Sintel Clean training end-point-error of RAFT from 1.42 to 1.26, GMA from 1.31 to 1.18, and that of FlowFormer from 0.94 to 0.90, by solely adapting the convex upsampler.
Video tokenizers are essential for latent video diffusion models, converting raw video data into spatiotemporally compressed latent spaces for efficient training. However, extending state-of-the-art video tokenizers to achieve a temporal compression ratio beyond 4x without increasing channel capacity poses significant challenges. In this work, we propose an alternative approach to enhance temporal compression. We find that the reconstruction quality of temporally subsampled videos from a low-compression encoder surpasses that of high-compression encoders applied to original videos. This indicates that high-compression models can leverage representations from lower-compression models. Building on this insight, we develop a bootstrapped high-temporal-compression model that progressively trains high-compression blocks atop well-trained lower-compression models. Our method includes a cross-level feature-mixing module to retain information from the pretrained low-compression model and guide higher-compression blocks to capture the remaining details from the full video sequence. Evaluation of video benchmarks shows that our method significantly improves reconstruction quality while increasing temporal compression compared to directly training the full model. Furthermore, the resulting compact latent space effectively trains a video diffusion model for high-quality video generation with a significantly reduced token budget.
Applying image processing algorithms independently to each video frame often leads to temporal inconsistency in the resulting video. To address this issue, we present a novel and general approach for blind video temporal consistency. Our method is only trained on a pair of original and processed videos directly instead of a large dataset. Unlike most previous methods that enforce temporal consistency with optical flow, we show that temporal consistency can be achieved by training a convolutional network on a video with the Deep Video Prior. Moreover, a carefully designed iteratively reweighted training strategy is proposed to address the challenging multimodal inconsistency problem. We demonstrate the effectiveness of our approach on 7 computer vision tasks on videos. Extensive quantitative and perceptual experiments show that our approach obtains superior performance than state-of-the-art methods on blind video temporal consistency. Our source codes are publicly available at github.com/ChenyangLEI/deep-video-prior.
Applying an image processing algorithm independently to each video frame often leads to temporal inconsistency in the resulting video. To address this issue, we present a novel and general approach for blind video temporal consistency. Our method is only trained on a pair of original and processed videos directly instead of a large dataset. Unlike most previous methods that enforce temporal consistency with optical flow, we show that temporal consistency can be achieved by training a convolutional neural network on a video with Deep Video Prior (DVP). Moreover, a carefully designed iteratively reweighted training strategy is proposed to address the challenging multimodal inconsistency problem. We demonstrate the effectiveness of our approach on 7 computer vision tasks on videos. Extensive quantitative and perceptual experiments show that our approach obtains superior performance than state-of-the-art methods on blind video temporal consistency. We further extend DVP to video propagation and demonstrate its effectiveness in propagating three different types of information (color, artistic style, and object segmentation). A progressive propagation strategy with pseudo labels is also proposed to enhance DVP's performance on video propagation. Our source codes are publicly available at https://github.com/ChenyangLEI/deep-video-prior.
Video enhancement plays an important role in various video applications. In this paper, we propose a new intra-and-inter-constraint-based video enhancement approach aiming to 1) achieve high intra-frame quality of the entire picture where multiple region-of-interests (ROIs) can be adaptively and simultaneously enhanced, and 2) guarantee the inter-frame quality consistencies among video frames. We first analyze features from different ROIs and create a piecewise tone mapping curve for the entire frame such that the intra-frame quality of a frame can be enhanced. We further introduce new inter-frame constraints to improve the temporal quality consistency. Experimental results show that the proposed algorithm obviously outperforms the state-of-the-art algorithms.
Video data is with complex temporal dynamics due to various factors such as camera motion, speed variation, and different activities. To effectively capture this diverse motion pattern, this paper presents a new temporal adaptive module ({\bf TAM}) to generate video-specific temporal kernels based on its own feature map. TAM proposes a unique two-level adaptive modeling scheme by decoupling the dynamic kernel into a location sensitive importance map and a location invariant aggregation weight. The importance map is learned in a local temporal window to capture short-term information, while the aggregation weight is generated from a global view with a focus on long-term structure. TAM is a modular block and could be integrated into 2D CNNs to yield a powerful video architecture (TANet) with a very small extra computational cost. The extensive experiments on Kinetics-400 and Something-Something datasets demonstrate that our TAM outperforms other temporal modeling methods consistently, and achieves the state-of-the-art performance under the similar complexity. The code is available at \url{ https://github.com/liu-zhy/temporal-adaptive-module}.
Exploiting similar and sharper scene patches in spatio-temporal neighborhoods is critical for video deblurring. However, CNN-based methods show limitations in capturing long-range dependencies and modeling non-local self-similarity. In this paper, we propose a novel framework, Flow-Guided Sparse Transformer (FGST), for video deblurring. In FGST, we customize a self-attention module, Flow-Guided Sparse Window-based Multi-head Self-Attention (FGSW-MSA). For each $query$ element on the blurry reference frame, FGSW-MSA enjoys the guidance of the estimated optical flow to globally sample spatially sparse yet highly related $key$ elements corresponding to the same scene patch in neighboring frames. Besides, we present a Recurrent Embedding (RE) mechanism to transfer information from past frames and strengthen long-range temporal dependencies. Comprehensive experiments demonstrate that our proposed FGST outperforms state-of-the-art (SOTA) methods on both DVD and GOPRO datasets and even yields more visually pleasing results in real video deblurring. Code and pre-trained models are publicly available at https://github.com/linjing7/VR-Baseline
Video denoising refers to the problem of removing "noise" from a video sequence. Here the term "noise" is used in a broad sense to refer to any corruption or outlier or interference that is not the quantity of interest. In this work, we develop a novel approach to video denoising that is based on the idea that many noisy or corrupted videos can be split into three parts - the "low-rank layer", the "sparse layer", and a small residual (which is small and bounded). We show, using extensive experiments, that our denoising approach outperforms the state-of-the-art denoising algorithms.
How to properly model the inter-frame relation within the video sequence is an important but unsolved challenge for video restoration (VR). In this work, we propose an unsupervised flow-aligned sequence-to-sequence model (S2SVR) to address this problem. On the one hand, the sequence-to-sequence model, which has proven capable of sequence modeling in the field of natural language processing, is explored for the first time in VR. Optimized serialization modeling shows potential in capturing long-range dependencies among frames. On the other hand, we equip the sequence-to-sequence model with an unsupervised optical flow estimator to maximize its potential. The flow estimator is trained with our proposed unsupervised distillation loss, which can alleviate the data discrepancy and inaccurate degraded optical flow issues of previous flow-based methods. With reliable optical flow, we can establish accurate correspondence among multiple frames, narrowing the domain difference between 1D language and 2D misaligned frames and improving the potential of the sequence-to-sequence model. S2SVR shows superior performance in multiple VR tasks, including video deblurring, video super-resolution, and compressed video quality enhancement. Code and models are publicly available at https://github.com/linjing7/VR-Baseline
Video restoration tasks, including super-resolution, deblurring, etc, are drawing increasing attention in the computer vision community. A challenging benchmark named REDS is released in the NTIRE19 Challenge. This new benchmark challenges existing methods from two aspects: (1) how to align multiple frames given large motions, and (2) how to effectively fuse different frames with diverse motion and blur. In this work, we propose a novel Video Restoration framework with Enhanced Deformable networks, termed EDVR, to address these challenges. First, to handle large motions, we devise a Pyramid, Cascading and Deformable (PCD) alignment module, in which frame alignment is done at the feature level using deformable convolutions in a coarse-to-fine manner. Second, we propose a Temporal and Spatial Attention (TSA) fusion module, in which attention is applied both temporally and spatially, so as to emphasize important features for subsequent restoration. Thanks to these modules, our EDVR wins the champions and outperforms the second place by a large margin in all four tracks in the NTIRE19 video restoration and enhancement challenges. EDVR also demonstrates superior performance to state-of-the-art published methods on video super-resolution and deblurring. The code is available at https://github.com/xinntao/EDVR.
Non-local patch based methods were until recently state-of-the-art for image denoising but are now outperformed by CNNs. Yet they are still the state-of-the-art for video denoising, as video redundancy is a key factor to attain high denoising performance. The problem is that CNN architectures are hardly compatible with the search for self-similarities. In this work we propose a new and efficient way to feed video self-similarities to a CNN. The non-locality is incorporated into the network via a first non-trainable layer which finds for each patch in the input image its most similar patches in a search region. The central values of these patches are then gathered in a feature vector which is assigned to each image pixel. This information is presented to a CNN which is trained to predict the clean image. We apply the proposed architecture to image and video denoising. For the latter patches are searched for in a 3D spatio-temporal volume. The proposed architecture achieves state-of-the-art results. To the best of our knowledge, this is the first successful application of a CNN to video denoising.
Motion blur is one of the most common degradation artifacts in dynamic scene photography. This paper reviews the NTIRE 2020 Challenge on Image and Video Deblurring. In this challenge, we present the evaluation results from 3 competition tracks as well as the proposed solutions. Track 1 aims to develop single-image deblurring methods focusing on restoration quality. On Track 2, the image deblurring methods are executed on a mobile platform to find the balance of the running speed and the restoration accuracy. Track 3 targets developing video deblurring methods that exploit the temporal relation between input frames. In each competition, there were 163, 135, and 102 registered participants and in the final testing phase, 9, 4, and 7 teams competed. The winning methods demonstrate the state-ofthe-art performance on image and video deblurring tasks.
We consider denoising and deblurring problems for tensors. While images can be discretized as matrices, the analogous procedure for color images or videos leads to a tensor formulation. We extend the classical ROF functional for variational denoising and deblurring to the tensor case by employing multi-dimensional total variation regularization. Furthermore, the resulting minimization problem is calculated by the FISTA method generalized to the tensor case. We provide some numerical experiments by applying the scheme to the denoising, the deblurring, and the recoloring of color images as well as to the deblurring of videos.
In this paper, we propose a learning-based approach for denoising raw videos captured under low lighting conditions. We propose to do this by first explicitly aligning the neighboring frames to the current frame using a convolutional neural network (CNN). We then fuse the registered frames using another CNN to obtain the final denoised frame. To avoid directly aligning the temporally distant frames, we perform the two processes of alignment and fusion in multiple stages. Specifically, at each stage, we perform the denoising process on three consecutive input frames to generate the intermediate denoised frames which are then passed as the input to the next stage. By performing the process in multiple stages, we can effectively utilize the information of neighboring frames without directly aligning the temporally distant frames. We train our multi-stage system using an adversarial loss with a conditional discriminator. Specifically, we condition the discriminator on a soft gradient mask to prevent introducing high-frequency artifacts in smooth regions. We show that our system is able to produce temporally coherent videos with realistic details. Furthermore, we demonstrate through extensive experiments that our approach outperforms state-of-the-art image and video denoising methods both numerically and visually.
Video deblurring is still an unsolved problem due to the challenging spatio-temporal modeling process. While existing convolutional neural network-based methods show a limited capacity for effective spatial and temporal modeling for video deblurring. This paper presents VDTR, an effective Transformer-based model that makes the first attempt to adapt Transformer for video deblurring. VDTR exploits the superior long-range and relation modeling capabilities of Transformer for both spatial and temporal modeling. However, it is challenging to design an appropriate Transformer-based model for video deblurring due to the complicated non-uniform blurs, misalignment across multiple frames and the high computational costs for high-resolution spatial modeling. To address these problems, VDTR advocates performing attention within non-overlapping windows and exploiting the hierarchical structure for long-range dependencies modeling. For frame-level spatial modeling, we propose an encoder-decoder Transformer that utilizes multi-scale features for deblurring. For multi-frame temporal modeling, we adapt Transformer to fuse multiple spatial features efficiently. Compared with CNN-based methods, the proposed method achieves highly competitive results on both synthetic and real-world video deblurring benchmarks, including DVD, GOPRO, REDS and BSD. We hope such a Transformer-based architecture can serve as a powerful alternative baseline for video deblurring and other video restoration tasks. The source code will be available at \url{https://github.com/ljzycmd/VDTR}.
Due to its high speed and low latency, DVS is frequently employed in motion deblurring. Ideally, high-quality events would adeptly capture intricate motion information. However, real-world events are generally degraded, thereby introducing significant artifacts into the deblurred results. In response to this challenge, we model the degradation of events and propose RDNet to improve the quality of image deblurring. Specifically, we first analyze the mechanisms underlying degradation and simulate paired events based on that. These paired events are then fed into the first stage of the RDNet for training the restoration model. The events restored in this stage serve as a guide for the second-stage deblurring process. To better assess the deblurring performance of different methods on real-world degraded events, we present a new real-world dataset named DavisMCR. This dataset incorporates events with diverse degradation levels, collected by manipulating environmental brightness and target object contrast. Our experiments are conducted on synthetic datasets (GOPRO), real-world datasets (REBlur), and the proposed dataset (DavisMCR). The results demonstrate that RDNet outperforms classical event denoising methods in event restoration. Furthermore, RDNet exhibits better performance in deblurring tasks compared to state-of-the-art methods. DavisMCR are available at https://github.com/Yeeesir/DVS_RDNet.
Video deblurring aims at recovering sharp details from a sequence of blurry frames. Despite the proliferation of depth sensors in mobile phones and the potential of depth information to guide deblurring, depth-aware deblurring has received only limited attention. In this work, we introduce the 'Depth-Aware VIdeo DEblurring' (DAVIDE) dataset to study the impact of depth information in video deblurring. The dataset comprises synchronized blurred, sharp, and depth videos. We investigate how the depth information should be injected into the existing deep RGB video deblurring models, and propose a strong baseline for depth-aware video deblurring. Our findings reveal the significance of depth information in video deblurring and provide insights into the use cases where depth cues are beneficial. In addition, our results demonstrate that while the depth improves deblurring performance, this effect diminishes when models are provided with a longer temporal context. Project page: https://germanftv.github.io/DAVIDE.github.io/ .
Video restoration, which aims to restore clear frames from degraded videos, has numerous important applications. The key to video restoration depends on utilizing inter-frame information. However, existing deep learning methods often rely on complicated network architectures, such as optical flow estimation, deformable convolution, and cross-frame self-attention layers, resulting in high computational costs. In this study, we propose a simple yet effective framework for video restoration. Our approach is based on grouped spatial-temporal shift, which is a lightweight and straightforward technique that can implicitly capture inter-frame correspondences for multi-frame aggregation. By introducing grouped spatial shift, we attain expansive effective receptive fields. Combined with basic 2D convolution, this simple framework can effectively aggregate inter-frame information. Extensive experiments demonstrate that our framework outperforms the previous state-of-the-art method, while using less than a quarter of its computational cost, on both video deblurring and video denoising tasks. These results indicate the potential for our approach to significantly reduce computational overhead while maintaining high-quality results. Code is avaliable at https://github.com/dasongli1/Shift-Net.
Stereo video super-resolution (SVSR) aims to enhance the spatial resolution of the low-resolution video by reconstructing the high-resolution video. The key challenges in SVSR are preserving the stereo-consistency and temporal-consistency, without which viewers may experience 3D fatigue. There are several notable works on stereoscopic image super-resolution, but there is little research on stereo video super-resolution. In this paper, we propose a novel Transformer-based model for SVSR, namely Trans-SVSR. Trans-SVSR comprises two key novel components: a spatio-temporal convolutional self-attention layer and an optical flow-based feed-forward layer that discovers the correlation across different video frames and aligns the features. The parallax attention mechanism (PAM) that uses the cross-view information to consider the significant disparities is used to fuse the stereo views. Due to the lack of a benchmark dataset suitable for the SVSR task, we collected a new stereoscopic video dataset, SVSR-Set, containing 71 full high-definition (HD) stereo videos captured using a professional stereo camera. Extensive experiments on the collected dataset, along with two other datasets, demonstrate that the Trans-SVSR can achieve competitive performance compared to the state-of-the-art methods. Project code and additional results are available at https://github.com/H-deep/Trans-SVSR/
本报告将视频增强与超分领域的研究归纳为十个维度。技术路线经历了从经典CNN传播机制到Transformer/Mamba长程建模,再到扩散模型生成式重建的演进;应用场景从通用增强扩展到卫星遥感、人脸修复及Raw域多帧融合等垂直领域;在工程实现上,则兼顾了时空同步超分、盲退化泛化、轻量化部署以及与视频编码标准的深度融合。整体呈现出从纯像素重建向感知增强、从固定倍数向任意尺度、从实验环境向真实复杂退化场景转化的趋势。