光场视频插帧方法(时空超分/时域超分)
自监督光场视频重建:稀疏视角(双目/单目)到时空光场
共同点是针对“光场视频重建/插帧的稀疏视角输入”提出自监督学习框架,核心通过几何一致性(视差/极线/双目或单目几何)、光度一致性与时序约束来指导网络,同时引入低秩或分层光场先验来正则化并提升视角插值/外推能力;其中也强调遮挡(disocclusion)等现实困难的处理。
- SeLFVi: Self-supervised Light-Field Video Reconstruction from Stereo Video(Prasan A. Shedligeri, Florian Schiffers, Sushobhan Ghosh, O. Cossairt, K. Mitra, 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV))
- Synthesizing Light Field Video from Monocular Video(Shrisudhan Govindarajan, Prasan Shedligeri, Sarah, Kaushik Mitra, 2022, ArXiv Preprint)
光场视频时域插值/帧插帧:角向一致性与运动驱动合成
共同点是“时间插值(VFI/时域超分)”范式:从两个(或相邻)时间点的输入生成中间时刻光场/视频帧;方法普遍依赖运动/光流或场景流进行时间对齐,并通过网络合成中间视角,同时强调一致性(角向一致性、空间/时间一致性)与表示效率(如显式分离运动与外观特征、或注意力机制复用)。
- Angularly Consistent Light Field Video Interpolation(Pierre David, Mikael Le Pendu, C. Guillemot, 2020, 2020 IEEE International Conference on Multimedia and Expo (ICME))
- Light field video frame interpolation using hierarchical spatial-angular-temporal information decoupling and fusion(Mingxing Fu, Yeyao Chen, Chongchong Jin, Haiyong Xu, Ting Luo, Zongju Peng, Gangyi Jiang, 2026, Displays)
- VFIMamba: Video Frame Interpolation with State Space Models(Yutao Cui, Chunxu Liu, Kai Ma, Limin Wang, Guozhen Zhang, Xiaotong Zhao, 2024, Advances in Neural Information Processing Systems 37)
- Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation(Guozhen Zhang, Yuhan Zhu, Hongya Wang, Youxin Chen, Gangshan Wu, Limin Wang, 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
面向插帧的光场几何与光流估计:以运动/深度传播为核心
共同点是引入“时空光场相关的运动/几何(光流、深度、视角传播)”作为插帧或重建的关键中间表征:前者通过学习的时空流估计把LF角信息传播到2D视频并用外观估计合成高帧率光场视频;后者基于4D光场几何(超像素/斜平面)进行全视图光流与深度估计,为跨视角的时间/视角一致性提供物理约束。
- Light field video capture using a learning-based hybrid imaging system(Tingxian Wang, Jun-Yan Zhu, N. Kalantari, Alexei A. Efros, R. Ramamoorthi, 2017, ACM Transactions on Graphics)
- Full View Optical Flow Estimation Leveraged From Light Field Superpixel(Hao Zhu, Xiaoming Sun, Qi Zhang, Qing Wang, A. Robles-Kelly, Hongdong Li, Shaodi You, 2020, IEEE Transactions on Computational Imaging)
时空超分一阶段/端到端:联合空间-角度-时间提升
共同点是将任务表述为“空间-时间联合超分/慢动作时空超分”,从低分辨率且低帧率观测直接恢复高质量的时空序列;其中一个聚焦LF视频,通过SAI重组织与多尺度/角向辅助特征聚合来同时处理空间、角度与时间;另一个把空间超分与时间插值合成到一阶段框架,用特征时间插值避免显式补帧导致的两阶段开销。
- Space-Time Super-Resolution for Light Field Videos(Zeyu Xiao, Zhen Cheng, Zhiwei Xiong, 2023, IEEE Transactions on Image Processing)
- Zooming SlowMo: An Efficient One-Stage Framework for Space-Time Video Super-Resolution(Xiaoyu Xiang, Yapeng Tian, Yulun Zhang, Yun Fu, Jan P. Allebach, Chenliang Xu, 2021, ArXiv Preprint)
连续时间光场视频因子化:POV驱动的时序一致性重建
共同点是强调“连续时间感知(POV/刷新率)”而非仅离散帧约束:通过序列级低秩/因子分解,并把POV效应显式纳入全局目标以消除闪烁并提升时序一致性;同时提出实用的序列级高效优化/因子化实现(cuboid-wise、因果TF-C)以适配系统部署。
- Temporal Fusion: Continuous-Time Light Field Video Factorization(Li-De Chen, Li-Qun Weng, Hao-Chien Cheng, An-Yu Cheng, Chao-Tsung Huang, 2025, IEEE Transactions on Image Processing)
通用视频时域建模:短/长时差分、融合策略与循环聚合
共同点是围绕“视频时序建模/时间特征聚合”的通用VSR/VFI思想展开,用短期/长期时序差分、不同融合方式(2D/3D CNN、RNN)或滑动窗口循环状态来提高时间一致性与细节恢复;虽然不专门以光场为唯一对象,但为光场时域插帧提供了可迁移的时序建模模块与思想。
- Local-Global Temporal Difference Learning for Satellite Video Super-Resolution(Yi Xiao, Qiangqiang Yuan, Kui Jiang, Xianyu Jin, Jiang He, Liangpei Zhang, Chia-Wen Lin, 2023, ArXiv Preprint)
- Revisiting Temporal Modeling for Video Super-resolution(Takashi Isobe, Fang Zhu, Xu Jia, Shengjin Wang, 2020, ArXiv Preprint)
- Sliding Window Recurrent Network for Efficient Video Super-Resolution(Wenyi Lian, Wenjing Lian, 2022, ArXiv Preprint)
多视角光场数据/对应与场景流基础设施:对齐与评测支撑
共同点是提供“多视角对应/场景流估计与数据管线”支撑光场视频插帧研究:在线对应与场景流估计为跨视角几何对齐提供基础;数据集与采集/处理管线则面向稀疏与宽基线多视角视频LF,降低方法训练与评测的工程门槛。
- Efficient Multi‐image Correspondences for On‐line Light Field Video Processing(Lukasz Dabala, M. Ziegler, P. Didyk, Frederik Zilly, J. Keinert, K. Myszkowski, H. Seidel, Przemyslaw Rokita, Tobias Ritschel, 2016, Computer Graphics Forum)
- Multi-view Scene Flow Estimation: A View Centered Variational Approach(Tali Basha, Y. Moses, N. Kiryati, 2010, International Journal of Computer Vision)
- Dataset and Pipeline for Multi-view Light-Field Video(Neus Sabater, Guillaume Boisson, B. Vandame, P. Kerbiriou, Frederic Babon, Matthieu Hog, R. Gendrot, T. Langlois, Olivier Bureller, Arno Schubert, V. Allié, 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
面向时序质量的生成/鲁棒建模:运动先验、广义退化与时变核一致性
共同点是把“插帧/超分结果质量”与“时序运动或退化建模”直接耦合:扩散模型通过运动先验提升中间帧生成的物理/感知一致性;NegVSR通过更广泛的噪声退化建模增强真实场景泛化;盲视频SR则利用退化核的时间一致性避免因假设固定核带来的时序伪影。这类思想可用于提升光场时域超分的鲁棒性与时序稳定性。
- Motion-aware Latent Diffusion Models for Video Frame Interpolation(Zhilin Huang, Yijie Yu, Ling Yang, C. Qin, Bing Zheng, Xiawu Zheng, Zikun Zhou, Yaowei Wang, Wenming Yang, 2024, Proceedings of the 32nd ACM International Conference on Multimedia)
- NegVSR: Augmenting Negatives for Generalized Noise Modeling in Real-World Video Super-Resolution(Yexing Song, Meilin Wang, Zhijing Yang, Xiaoyu Xian, Yukai Shi, 2023, ArXiv Preprint)
- Temporal Kernel Consistency for Blind Video Super-Resolution(Lichuan Xiang, Royson Lee, Mohamed S. Abdelfattah, Nicholas D. Lane, Hongkai Wen, 2021, ArXiv Preprint)
学习式光场成像方法综述:任务格局与趋势背景
共同点是综述性质:对学习式光场成像与相关框架、评测与数据集进行系统梳理,为理解光场视频插帧方法在更大范围(任务划分、学习框架趋势、挑战)中的位置提供总体背景支撑。
- Learning-based light field imaging: an overview(Saeed Mahmoudpour, Carla Pagliari, P. Schelkens, 2024, EURASIP Journal on Image and Video Processing)
该批文献可按“自监督稀疏视角到光场视频重建”“光场视频时域插值(VFI)”“利用光场几何/光流的运动驱动”“联合空间-角度-时间的时空超分(端到端/一阶段)”“连续时间POV驱动的因子化时序一致性”“通用视频时序建模模块”“多视角对齐/场景流与数据管线支撑”“生成与鲁棒建模(运动先验、退化泛化、时变核一致性)”“综述背景”九类逻辑分组,从而覆盖光场视频插帧研究链条的关键环节。
总计21篇相关文献
… suffering from low frame rates when providing high spatial-angular sampling. To address the above challenges, this paper proposes a new LFV frame interpolation method based on …
In this paper, we address the problem of temporal interpolation of sparsely sampled video light fields using dense scene flows. Given light fields at two time instants, the goal is to interpolate an intermediate light field to form a spatially, angularly and temporally coherent light field video sequence. We first compute angularly coherent bidirectional scene flows between the two input light fields. We then use the optical flows and the two light fields as inputs to a convolutional neural network that synthesizes independently the views of the light field at an intermediate time. In order to measure the angular consistency of a light field, we propose a new metric based on epipolar geometry. Experimental results show that the proposed method produces light fields that are angularly coherent while keeping similar temporal and spatial consistency as state-of-the-art video frame interpolation methods.
Light field cameras have many advantages over traditional cameras, as they allow the user to change various camera settings after capture. However, capturing light fields requires a huge bandwidth to record the data: a modern light field camera can only take three images per second. This prevents current consumer light field cameras from capturing light field videos. Temporal interpolation at such extreme scale (10x, from 3 fps to 30 fps) is infeasible as too much information will be entirely missing between adjacent frames. Instead, we develop a hybrid imaging system, adding another standard video camera to capture the temporal information. Given a 3 fps light field sequence and a standard 30 fps 2D video, our system can then generate a full light field video at 30 fps. We adopt a learning-based approach, which can be decomposed into two steps: spatio-temporal flow estimation and appearance estimation. The flow estimation propagates the angular information from the light field sequence to the 2D video, so we can warp input images to the target view. The appearance estimation then combines these warped images to output the final pixels. The whole process is trained end-to-end using convolutional neural networks. Experimental results demonstrate that our algorithm outperforms current video interpolation methods, enabling consumer light field videography, and making applications such as refocusing and parallax view generation achievable on videos for the first time.
Light field (LF) cameras suffer from a fundamental trade-off between spatial and angular resolutions. Additionally, due to the significant amount of data that needs to be recorded, the Lytro ILLUM, a modern LF camera, can only capture three frames per second. In this paper, we consider space-time super-resolution (SR) for LF videos, aiming at generating high-resolution and high-frame-rate LF videos from low-resolution and low-frame-rate observations. Extending existing space-time video SR methods to this task directly will meet two key challenges: 1) how to re-organize sub-aperture images (SAIs) efficiently and effectively given highly redundant LF videos, and 2) how to aggregate complementary information between multiple SAIs and frames considering the coherence in LF videos. To address the above challenges, we propose a novel framework for space-time super-resolving LF videos for the first time. First, we propose a novel Multi-Scale Dilated SAI Re-organization strategy for re-organizing SAIs into auxiliary view stacks with decreasing resolution as the Chebyshev distance in the angular dimension increases. In particular, the auxiliary view stack with original resolution preserves essential visual details, while the down-scaled view stacks capture long-range contextual information. Second, we propose the Multi-Scale Aggregated Feature extractor and the Angular-Assisted Feature Interpolation module to utilize and aggregate information from the spatial, angular, and temporal dimensions in LF videos. The former aggregates similar contents from different SAIs and frames for subsequent reconstruction in a disparity-free manner at the feature level, whereas the latter interpolates intermediate frames temporally by implicitly aggregating geometric information. Compared to other potential approaches, experimental results demonstrate that the reconstructed LF videos generated by our framework achieve higher reconstruction quality and better preserve the LF parallax structure and temporal consistency. The implementation code is available at https://github.com/zeyuxiao1997/LFSTVSR.
A factored display emits full-parallax dense-view light fields for a glasses-free 3D experience without sacrificing the spatial resolution of a liquid-crystal display (LCD). For static light fields, it achieves high-quality reconstruction by applying frame-based low-rank factorization to time-multiplexed sub-frame contents of stacked LCDs. However, for light field videos such frame-based factorization could introduce reconstruction artifacts and visual flickers and further cause human discomfort. The artifacts mainly come from incomplete constraints for the emitted light fields that are actually perceived in continuous time, instead of discrete frames. In particular, the perceived light fields are related to the persistence-of-vision (POV) effect of human eyes and the refresh rates of LCD displays, which is not well explored in previous work. In this work, we introduce a light-field video factorization framework—temporal fusion (TF)—to resolve these issues. To begin with, we explicitly formulate the continuous-time POV effect into a global factorization objective functional to eliminate visual flickers and enhance image quality. We further show that this optimization problem can be solved by sequence-level iterative updates on LCD sub-frames. Then, to tackle the enormous requirement of memory access for the sequence-level processing flow, we devise an efficient cuboid-wise factorization algorithm which enables practical GPU implementation. We also devise another lightweight causal framework, TF-C, for supporting low-latency applications. Finally, extensive experiments are performed to verify the effectiveness. Compared to the plain frame-based factorization, TF/TF-C can improve temporal consistency by reducing flicker values by 85%/91% and enhance reconstruction quality by increasing PSNR values by 5.0dB/3.7dB. In addition, we present a prototype dual-layer factored display, which was built with two 240-Hz high-refresh-rate LCDs, to demonstrate the visual quality for real-life applications.
… Inter-frame modeling is pivotal in generating intermediate frames for video frame interpolation (VFI)… Finally, we use the extracted inter-frame features to estimate motion and generate the …
With the advancement of AIGC, video frame interpolation (VFI) has become a crucial component in existing video generation frameworks, attracting widespread research interest. For the VFI task, the motion estimation between neighboring frames plays a crucial role in avoiding motion ambiguity. However, existing VFI methods always struggle to accurately predict the motion information between consecutive frames, and this imprecise estimation leads to blurred and visually incoherent interpolated frames. In this paper, we propose a novel diffusion framework, Motion-Aware latent Diffusion models (MADiff), which is specifically designed for the VFI task. By incorporating motion priors between the conditional neighboring frames with the target interpolated frame predicted throughout the diffusion sampling procedure, MADiff progressively refines the intermediate outcomes, culminating in generating both visually smooth and realistic results. Extensive experiments conducted on benchmark datasets demonstrate that our method achieves state-of-the-art performance significantly outperforming existing approaches, especially under challenging scenarios involving dynamic textures with complex motion.
Effectively extracting inter-frame motion and appearance information is important for video frame interpolation (VFI). Previous works either extract both types of information in a mixed way or devise separate modules for each type of information, which lead to representation ambiguity and low efficiency. In this paper, we propose a new module to explicitly extract motion and appearance information via a unified operation. Specifically, we rethink the information process in inter-frame attention and reuse its attention map for both appearance feature enhancement and motion information extraction. Furthermore, for efficient VFI, our proposed module could be seamlessly integrated into a hybrid CNN and Transformer architecture. This hybrid pipeline can alleviate the computational complexity of inter-frame attention as well as preserve detailed low-level structure information. Experimental results demonstrate that, for both fixed- and arbitrary-timestep interpolation, our method achieves state-of-the-art performance on various datasets. Meanwhile, our approach enjoys a lighter computation overhead over models with close performance. The source code and models are available at https://github.com/MCG-NJU/EMA-VFI.
Conventional photography can only provide a two-dimensional image of the scene, whereas emerging imaging modalities such as light field enable the representation of higher dimensional visual information by capturing light rays from different directions. Light fields provide immersive experiences, a sense of presence in the scene, and can enhance different vision tasks. Hence, research into light field processing methods has become increasingly popular. It does, however, come at the cost of higher data volume and computational complexity. With the growing deployment of machine-learning and deep architectures in image processing applications, a paradigm shift toward learning-based approaches has also been observed in the design of light field processing methods. Various learning-based approaches are developed to process the high volume of light field data efficiently for different vision tasks while improving performance. Taking into account the diversity of light field vision tasks and the deployed learning-based frameworks, it is necessary to survey the scattered learning-based works in the domain to gain insight into the current trends and challenges. This paper aims to review the existing learning-based solutions for light field imaging and to summarize the most promising frameworks. Moreover, evaluation methods and available light field datasets are highlighted. Lastly, the review concludes with a brief outlook for future research directions.
… for Light-Field video. The proposed algorithms are specially tailored to process sparse and widebaseline multi-view videos … In this paper we focus on camera arrays as a video LightField …
In this paper, we present a full view optical flow estimation method for plenoptic imaging. Our method employs the structure delivered by the four-dimensional light field over multiple views making use of superpixels. These superpixels are four dimensional in nature and can be used to represent the objects in the scene as a set of slanted-planes in three-dimensional space so as to recover a piecewise rigid depth estimate. Taking advantage of these superpixels and the corresponding slanted planes, we recover the optical flow and depth maps by using a two-step optimization scheme where the flow is propagated from the central view to the other views in the imagery. We illustrate the utility of our method for depth and flow estimation making use of a dataset of synthetically generated image sequences and real-world imagery captured using a Lytro Illum camera. We also compare our results with those yielded by a number of alternatives elsewhere in the literature.
… based on multi-resolution optical flow [BWS05, YWA10, LYT11… -resolution processing to multi-view data has been proposed … multi-view depth reconstruction method for video light fields. …
… In these studies regularization is imposed on the disparity and optical flow (2D formulation), while our assumptions refer directly to the 3D unknowns. Nor were these methods extended …
Light-field imaging is appealing to the mobile devices market because of its capability for intuitive post-capture processing. Acquiring light field (LF) data with high angular, spatial and temporal resolution poses significant challenges, especially with space constraints preventing bulky optics. At the same time, stereo video capture, now available on many consumer devices, can be interpreted as a sparse LF-capture. We explore the application of small baseline stereo videos for reconstructing high fidelity LF videos.We propose a self-supervised learning-based algorithm for LF video reconstruction from stereo video. The self- supervised LF video reconstruction is guided via the geometric information from the individual stereo pairs and the temporal information from the video sequence. LF estimation is further regularized by a low-rank constraint based on layered LF displays. The proposed self-supervised algorithm facilitates advantages such as post-training finetuning on test sequences and variable angular view interpolation and extrapolation. Quantitatively the reconstructed LF videos show higher fidelity than previously proposed unsupervised approaches. We demonstrate our results via LF videos generated from publicly available stereo videos acquired from commercially available stereoscopic cameras. Finally, we demonstrate that our reconstructed LF videos allow applications such as post-capture focus control and region-of-interest (RoI) based focus tracking for videos.
The hardware challenges associated with light-field(LF) imaging has made it difficult for consumers to access its benefits like applications in post-capture focus and aperture control. Learning-based techniques which solve the ill-posed problem of LF reconstruction from sparse (1, 2 or 4) views have significantly reduced the requirement for complex hardware. LF video reconstruction from sparse views poses a special challenge as acquiring ground-truth for training these models is hard. Hence, we propose a self-supervised learning-based algorithm for LF video reconstruction from monocular videos. We use self-supervised geometric, photometric and temporal consistency constraints inspired from a recent self-supervised technique for LF video reconstruction from stereo video. Additionally, we propose three key techniques that are relevant to our monocular video input. We propose an explicit disocclusion handling technique that encourages the network to inpaint disoccluded regions in a LF frame, using information from adjacent input temporal frames. This is crucial for a self-supervised technique as a single input frame does not contain any information about the disoccluded regions. We also propose an adaptive low-rank representation that provides a significant boost in performance by tailoring the representation to each input scene. Finally, we also propose a novel refinement block that is able to exploit the available LF image data using supervised learning to further refine the reconstruction quality. Our qualitative and quantitative analysis demonstrates the significance of each of the proposed building blocks and also the superior results compared to previous state-of-the-art monocular LF reconstruction techniques. We further validate our algorithm by reconstructing LF videos from monocular videos acquired using a commercial GoPro camera.
In this paper, we address the space-time video super-resolution, which aims at generating a high-resolution (HR) slow-motion video from a low-resolution (LR) and low frame rate (LFR) video sequence. A naïve method is to decompose it into two sub-tasks: video frame interpolation (VFI) and video super-resolution (VSR). Nevertheless, temporal interpolation and spatial upscaling are intra-related in this problem. Two-stage approaches cannot fully make use of this natural property. Besides, state-of-the-art VFI or VSR deep networks usually have a large frame reconstruction module in order to obtain high-quality photo-realistic video frames, which makes the two-stage approaches have large models and thus be relatively time-consuming. To overcome the issues, we present a one-stage space-time video super-resolution framework, which can directly reconstruct an HR slow-motion video sequence from an input LR and LFR video. Instead of reconstructing missing LR intermediate frames as VFI models do, we temporally interpolate LR frame features of the missing LR frames capturing local temporal contexts by a feature temporal interpolation module. Extensive experiments on widely used benchmarks demonstrate that the proposed framework not only achieves better qualitative and quantitative performance on both clean and noisy LR frames but also is several times faster than recent state-of-the-art two-stage networks. The source code is released in https://github.com/Mukosame/Zooming-Slow-Mo-CVPR-2020 .
The capability of video super-resolution (VSR) to synthesize high-resolution (HR) video from ideal datasets has been demonstrated in many works. However, applying the VSR model to real-world video with unknown and complex degradation remains a challenging task. First, existing degradation metrics in most VSR methods are not able to effectively simulate real-world noise and blur. On the contrary, simple combinations of classical degradation are used for real-world noise modeling, which led to the VSR model often being violated by out-of-distribution noise. Second, many SR models focus on noise simulation and transfer. Nevertheless, the sampled noise is monotonous and limited. To address the aforementioned problems, we propose a Negatives augmentation strategy for generalized noise modeling in Video Super-Resolution (NegVSR) task. Specifically, we first propose sequential noise generation toward real-world data to extract practical noise sequences. Then, the degeneration domain is widely expanded by negative augmentation to build up various yet challenging real-world noise sets. We further propose the augmented negative guidance loss to learn robust features among augmented negatives effectively. Extensive experiments on real-world datasets (e.g., VideoLQ and FLIR) show that our method outperforms state-of-the-art methods with clear margins, especially in visual quality. Project page is available at: https://negvsr.github.io/.
Video super-resolution plays an important role in surveillance video analysis and ultra-high-definition video display, which has drawn much attention in both the research and industrial communities. Although many deep learning-based VSR methods have been proposed, it is hard to directly compare these methods since the different loss functions and training datasets have a significant impact on the super-resolution results. In this work, we carefully study and compare three temporal modeling methods (2D CNN with early fusion, 3D CNN with slow fusion and Recurrent Neural Network) for video super-resolution. We also propose a novel Recurrent Residual Network (RRN) for efficient video super-resolution, where residual learning is utilized to stabilize the training of RNN and meanwhile to boost the super-resolution performance. Extensive experiments show that the proposed RRN is highly computational efficiency and produces temporal consistent VSR results with finer details than other temporal modeling methods. Besides, the proposed method achieves state-of-the-art results on several widely used benchmarks.
Video super-resolution (VSR) is the task of restoring high-resolution frames from a sequence of low-resolution inputs. Different from single image super-resolution, VSR can utilize frames' temporal information to reconstruct results with more details. Recently, with the rapid development of convolution neural networks (CNN), the VSR task has drawn increasing attention and many CNN-based methods have achieved remarkable results. However, only a few VSR approaches can be applied to real-world mobile devices due to the computational resources and runtime limitations. In this paper, we propose a \textit{Sliding Window based Recurrent Network} (SWRN) which can be real-time inference while still achieving superior performance. Specifically, we notice that video frames should have both spatial and temporal relations that can help to recover details, and the key point is how to extract and aggregate information. Address it, we input three neighboring frames and utilize a hidden state to recurrently store and update the important temporal information. Our experiment on REDS dataset shows that the proposed method can be well adapted to mobile devices and produce visually pleasant results.
Deep learning-based blind super-resolution (SR) methods have recently achieved unprecedented performance in upscaling frames with unknown degradation. These models are able to accurately estimate the unknown downscaling kernel from a given low-resolution (LR) image in order to leverage the kernel during restoration. Although these approaches have largely been successful, they are predominantly image-based and therefore do not exploit the temporal properties of the kernels across multiple video frames. In this paper, we investigated the temporal properties of the kernels and highlighted its importance in the task of blind video super-resolution. Specifically, we measured the kernel temporal consistency of real-world videos and illustrated how the estimated kernels might change per frame in videos of varying dynamicity of the scene and its objects. With this new insight, we revisited previous popular video SR approaches, and showed that previous assumptions of using a fixed kernel throughout the restoration process can lead to visual artifacts when upscaling real-world videos. In order to counteract this, we tailored existing single-image and video SR techniques to leverage kernel consistency during both kernel estimation and video upscaling processes. Extensive experiments on synthetic and real-world videos show substantial restoration gains quantitatively and qualitatively, achieving the new state-of-the-art in blind video SR and underlining the potential of exploiting kernel temporal consistency.
Optical-flow-based and kernel-based approaches have been extensively explored for temporal compensation in satellite Video Super-Resolution (VSR). However, these techniques are less generalized in large-scale or complex scenarios, especially in satellite videos. In this paper, we propose to exploit the well-defined temporal difference for efficient and effective temporal compensation. To fully utilize the local and global temporal information within frames, we systematically modeled the short-term and long-term temporal discrepancies since we observed that these discrepancies offer distinct and mutually complementary properties. Specifically, we devise a Short-term Temporal Difference Module (S-TDM) to extract local motion representations from RGB difference maps between adjacent frames, which yields more clues for accurate texture representation. To explore the global dependency in the entire frame sequence, a Long-term Temporal Difference Module (L-TDM) is proposed, where the differences between forward and backward segments are incorporated and activated to guide the modulation of the temporal feature, leading to a holistic global compensation. Moreover, we further propose a Difference Compensation Unit (DCU) to enrich the interaction between the spatial distribution of the target frame and temporal compensated results, which helps maintain spatial consistency while refining the features to avoid misalignment. Rigorous objective and subjective evaluations conducted across five mainstream video satellites demonstrate that our method performs favorably against state-of-the-art approaches. Code will be available at https://github.com/XY-boy/LGTD
该批文献可按“自监督稀疏视角到光场视频重建”“光场视频时域插值(VFI)”“利用光场几何/光流的运动驱动”“联合空间-角度-时间的时空超分(端到端/一阶段)”“连续时间POV驱动的因子化时序一致性”“通用视频时序建模模块”“多视角对齐/场景流与数据管线支撑”“生成与鲁棒建模(运动先验、退化泛化、时变核一致性)”“综述背景”九类逻辑分组,从而覆盖光场视频插帧研究链条的关键环节。