时序动作定位。

本报告统一了时序动作定位（TAL）领域的六大核心研究方向。整体趋势表现为：技术架构正经历从卷积神经网络向 Transformer 及 Mamba (SSM) 等能够处理长时序依赖的先进架构跨越；监督范式由重度依赖帧级标注的全监督向弱监督、点级监督及开集/零样本学习演进，以解决数据标注瓶颈；算法核心仍聚焦于边界精细化建模以提升定位精度；同时，研究视野已从实验室基准数据集扩展到实时在线监测、多模态融合及多样化的工业应用场景（如电力、医疗、体育），并开始注重模型的计算效率与复杂环境下的鲁棒性。

共 137 篇文献，6 个研究方向

弱监督时序动作定位 (WSTAL) 与特征挖掘

该组文献专注于在仅有视频级标签的情况下实现定位。研究重点包括多实例学习 (MIL)、对比学习、背景消除、伪标签生成以及通过注意力机制和特征建模（如扩散网络、嵌入建模）来增强判别性。这是目前降低标注成本的主流研究方向。相关文献: Xiaoyu Zhang et. al, 2021 等 42 篇文献

基于 Transformer、Mamba 与扩散模型的架构演进

这组研究代表了 TAL 骨干架构从 CNN 向更先进模型的转变。涵盖了利用自注意力机制捕获长程依赖的 Transformer 变体、针对超长视频的高效状态空间模型 (Mamba)、以及利用扩散模型 (Diffusion) 进行提名的生成式方法，旨在提升特征表征和端到端检测性能。相关文献: Ding Shi et. al, 2023 等 28 篇文献

动作边界精细化、提名生成与置信度优化

专注于解决动作定位中的核心难题：边界模糊。通过边界去噪、相位一致性、高分辨率建模、不确定性估计、以及置信度分数校准（Confidence Calibration）等技术，提高 Proposal 的生成质量和定位精度（IoU）。相关文献: Chuming Lin et. al, 2021 等 29 篇文献

开集定位、零样本学习与多模态融合

研究如何处理训练中未见过的动作类别（Open-vocabulary/Zero-shot），通常利用视觉-语言大模型（如 CLIP）的知识迁移。同时，涵盖了结合音频、文本描述和图像分割等多模态信息来增强动作语义理解的文献。相关文献: Chaolei Han et. al, 2025 等 14 篇文献

低成本标注策略、半监督与点级监督

针对全监督标注昂贵的问题，探索点级监督（Point-level）、半监督学习（Semi-supervised）及数据编程框架。通过自我监督预训练和一致性约束，在极少标注的情况下保持较高的定位性能。相关文献: Shuhei M. Yoshida et. al, 2024 等 10 篇文献

在线检测、高效计算与工业鲁棒性

侧重于实际应用场景，包括面向流式视频的在线动作检测（OAD）、模型压缩技术、端到端高效适配（如 LoRA/LoSA 微调）以及在电力、医疗、体育等特定垂直领域和噪声数据下的鲁棒性表现。相关文献: Min-Hang Hsu et. al, 2024 等 14 篇文献

总计216篇相关文献

Learning Salient Boundary Feature for Anchor-free Temporal Action Localization

学习显著边界特征的无锚点时间动作定位

Chuming Lin, C. Xu, Donghao Luo 等, 2021-2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Temporal action localization is an important yet challenging task in video understanding. Typically, such a task aims at inferring both the action category and localization of the start and end frame for each action instance in a long, untrimmed video. While most current models achieve good results by using pre-defined anchors and numerous actionness, such methods could be bothered with both large number of outputs and heavy tuning of locations and sizes corresponding to different anchors. Instead, anchor-free methods is lighter, getting rid of redundant hyper-parameters, but gains few attention. In this paper, we propose the first purely anchor-free temporal localization method, which is both efficient and effective. Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module to gather more valuable boundary features for each proposal with a novel boundary pooling, and (iii) several consistency constraints to make sure our model can find the accurate boundary given arbitrary proposals. Extensive experiments show that our method beats all anchor-based and actionness-guided methods with a remarkable margin on THUMOS14, achieving state-of-the-art results, and comparable ones on ActivityNet v1.3. Code is available at https://github.com/TencentYoutuResearch/ActionDetection-AFSD.