时序动作定位。
弱监督时序动作定位 (WSTAL) 与特征挖掘
该组文献专注于在仅有视频级标签的情况下实现定位。研究重点包括多实例学习 (MIL)、对比学习、背景消除、伪标签生成以及通过注意力机制和特征建模(如扩散网络、嵌入建模)来增强判别性。这是目前降低标注成本的主流研究方向。
- Weakly-supervised action localization via embedding-modeling iterative optimization(Xiaoyu Zhang, Haichao Shi, Changsheng Li, Peng Li, Zekun Li, Peng-Shan Ren, 2021, Pattern Recognit.)
- Learning Proposal-Aware Re-Ranking for Weakly-Supervised Temporal Action Localization(Yufan Hu, Jie Fu, Mengyuan Chen, Junyu Gao, Jianfeng Dong, Bin Fan, Hongmin Liu, 2024, IEEE Transactions on Circuits and Systems for Video Technology)
- Video Complicated-Information Extraction and Filtering Network for Weakly-Supervised Temporal Action Localization(Jiaxuan Li, Tiancheng Ma, Xiaohui Yang, Lijun Yang, Chen Zheng, 2025, IEEE Signal Processing Letters)
- Context Sensitive Network for weakly-supervised fine-grained temporal action localization(Cerui Dong, Qinying Liu, Zilei Wang, Y. Zhang, Fengjun Zhao, 2025, Neural networks : the official journal of the International Neural Network Society)
- Snippet-Inter Difference Attention Network for Weakly-Supervised Temporal Action Localization(Wei Zhou, Kang Lin, Weipeng Hu, Chao Xie, Tao Su, Haifeng Hu, Yap-Peng Tan, 2025, IEEE Transactions on Multimedia)
- Weakly supervised graph learning for action recognition in untrimmed video(Xiao Yao, Jia Zhang, Ruixuan Chen, Dan Zhang, Yifeng Zeng, 2022, The Visual Computer)
- Feature Weakening, Contextualization, and Discrimination for Weakly Supervised Temporal Action Localization(Md. Moniruzzaman, Zhaozheng Yin, 2024, IEEE Transactions on Multimedia)
- Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization(Songchun Zhang, Chunhui Zhao, 2023, IEEE Transactions on Circuits and Systems for Video Technology)
- ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization(Bo He, Xitong Yang, Le Kang, Zhiyu Cheng, Xingfa Zhou, Abhinav Shrivastava, 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- SAPS: Self-Attentive Pathway Search for weakly-supervised action localization with background-action augmentation(Xiaoyu Zhang, Yaru Zhang, Haichao Shi, Jing Dong, 2021, Comput. Vis. Image Underst.)
- Temporal Dropout for Weakly Supervised Action Localization(Chi Xie, Zikun Zhuang, Shengjie Zhao, Shuang Liang, 2022, ACM Transactions on Multimedia Computing, Communications and Applications)
- Exploring Denoised Cross-video Contrast for Weakly-supervised Temporal Action Localization(Jingjing Li, Tianyu Yang, Wei Ji, Jue Wang, Li Cheng, 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Entropy guided attention network for weakly-supervised action localization(Yi Cheng, Ying Sun, Hehe Fan, Tao Zhuo, J. Lim, Mohan Kankanhalli, 2022, Pattern Recognit.)
- Rethinking Pseudo-Label Guided Learning for Weakly Supervised Temporal Action Localization from the Perspective of Noise Correction(Quan Zhang, Yuxin Qi, Xi Tang, Rui Yuan, Xi Lin, Ke Zhang, Chun Yuan, 2025, No journal)
- Weakly-Supervised Action Localization by Hierarchically-structured Latent Attention Modeling(Guiqin Wang, Penghui Zhao, Cong Zhao, Shusen Yang, Jie Cheng, Luziwei Leng, Jianxing Liao, Qinghai Guo, 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV))
- A Collaborative Hierarchical Aggregation Network for Weakly Supervised Temporal Action Localization(Zan Gao, Xiaoyi Xu, Yibo Zhao, Chunjie Ma, Yanbing Xue, Riwei Wang, 2025, ACM Transactions on Multimedia Computing, Communications and Applications)
- Weakly Supervised Temporal Action Localization With Contrastive Learning-Based Action Salience Network(Jingtao Sun, Weipeng Shi, Shaoyang Hao, Jialin Wang, 2025, The European Journal on Artificial Intelligence)
- Action-to-Action Diffusion Network for Weakly Supervised Temporal Action Localization(Yuanbing Zou, Qingjie Zhao, Prodip Kumar Sarker, Le Yang, Binglu Wang, 2025, IEEE Transactions on Multimedia)
- Text-Video Knowledge Guided Prompting for Weakly Supervised Temporal Action Localization(Yuxiang Shao, Feifei Zhang, Changsheng Xu, 2025, IEEE Transactions on Circuits and Systems for Video Technology)
- Decoupled spatial-temporal predicting model for weakly supervised action localization(Guiqin Wang, Peng Zhao, Xiang Wang, Xin An, Qian Zhang, Shusen Yang, Qinghai Guo, 2026, Knowl. Based Syst.)
- Weakly-supervised Action Localization via Hierarchical Mining(Jialuo Feng, Fa-Ting Hong, Jiachen Du, Zhongang Qi, Ying Shan, Xiaohu Qie, Weihao Zheng, Jianping Wu, 2022, ArXiv)
- FCSC: Weakly-Supervised Temporal Action Localization via Feature Calibration-assisted Sequence Comparison(Ling Zhang, 2025, Journal of Computer Science and Frontier Technologies)
- GCLNet: Generalized Contrastive Learning for Weakly Supervised Temporal Action Localization(Jing Wang, Dehui Kong, Baocai Yin, 2025, IEEE Transactions on Big Data)
- Cross-Task Relation-Aware Consistency for Weakly Supervised Temporal Action Detection(Wenfei Yang, Huan Ren, Tianzhu Zhang, Zhe Zhang, Yongdong Zhang, Feng Wu, 2025, IEEE Transactions on Pattern Analysis and Machine Intelligence)
- Forcing the Whole Video as Background: An Adversarial Learning Strategy for Weakly Temporal Action Localization(Ziqiang Li, Yongxin Ge, Jiaruo Yu, Zhongming Chen, 2022, Proceedings of the 30th ACM International Conference on Multimedia)
- A Hybrid Attention Mechanism for Weakly-Supervised Temporal Action Localization(Ashraful Islam, Chengjiang Long, R. Radke, 2021, ArXiv)
- Ensemble Prototype Network For Weakly Supervised Temporal Action Localization(Kewei Wu, Wenjie Luo, Zhao Xie, Dan Guo, Zhao Zhang, Richang Hong, 2024, IEEE Transactions on Neural Networks and Learning Systems)
- Convex Combination Consistency between Neighbors for Weakly-supervised Action Localization(Qinying Liu, Zilei Wang, Ruoxi Chen, Zhilin Li, 2022, ArXiv)
- GCRNet: Global Context Relation Network for Weakly-Supervised Temporal Action Localization: Identify the target actions in a long untrimmed video and find the corresponding action start point and end point.(Yiguan Liao, Changzhen Qiu, Zhiyong Zhang, Luping Wang, Liang Wang, 2021, Proceedings of the 2021 5th International Conference on Video and Image Processing)
- CoLA: Weakly-Supervised Temporal Action Localization with Snippet Contrastive Learning(Can Zhang, Meng Cao, Dongming Yang, Jie Chen, Yuexian Zou, 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- TS-WTAL: A Two-Stage Framework for Weakly Supervised Temporal Action Localization(Shanzhen Lan, Shujun Wang, Wanting Wei, Yang Wang, 2025, 2025 International Conference on Culture-Oriented Science & Technology (CoST))
- Similar Modality Enhancement and Action Consistency Learning for Weakly Supervised Temporal Action Localization(Maodong Li, Chao Zheng, Jian Wang, Bing Li, 2025, No journal)
- Weakly-Supervised Action Localization by Hierarchical Attention Mechanism with Multi-Scale Fusion Strategies(Yu Wang, Sheng Zhao, 2024, 2024 IEEE International Conference on Multimedia and Expo (ICME))
- A Snippets Relation and Hard-Snippets Mask Network for Weakly-Supervised Temporal Action Localization(Yibo Zhao, Hua Zhang, Zan Gao, Weili Guan, Meng Wang, Shenyong Chen, 2024, IEEE Transactions on Circuits and Systems for Video Technology)
- Bridge the Gap: From Weak to Full Supervision for Temporal Action Localization with PseudoFormer(Ziyi Liu, Yangcen Liu, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Temporal and Semantic Correlation Network for Weakly-Supervised Temporal Action Localization(Kang Lin, Wei Zhou, Zhijie Zheng, Dihu Chen, Tao Su, 2025, ACM Transactions on Multimedia Computing, Communications and Applications)
- Action-Semantic Consistent Knowledge for Weakly-Supervised Action Localization(Yu Wang, Sheng Zhao, Shiwei Chen, 2024, IEEE Transactions on Multimedia)
- Video-Specific Query-Key Attention Modeling for Weakly-Supervised Temporal Action Localization(Xijun Wang, A. Katsaggelos, 2023, ArXiv)
- Weakly supervised temporal action localization via a multimodal feature map diffusion process(Yuanbing Zou, Qingjie Zhao, Shanshan Li, 2025, Eng. Appl. Artif. Intell.)
- Global context-aware attention model for weakly-supervised temporal action localization(Weina Fu, Wenxiang Zhang, Jing Long, Gautam Srivastava, Shuai Liu, 2025, Alexandria Engineering Journal)
- Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization(Geuntaek Lim, Hyunwoo Kim, Joonsoo Kim, Yukyung Choi, 2024, Proceedings of the 32nd ACM International Conference on Multimedia)
- Vectorized Evidential Learning for Weakly-Supervised Temporal Action Localization(Junyu Gao, Mengyuan Chen, Changsheng Xu, 2023, IEEE Transactions on Pattern Analysis and Machine Intelligence)
基于 Transformer、Mamba 与扩散模型的架构演进
这组研究代表了 TAL 骨干架构从 CNN 向更先进模型的转变。涵盖了利用自注意力机制捕获长程依赖的 Transformer 变体、针对超长视频的高效状态空间模型 (Mamba)、以及利用扩散模型 (Diffusion) 进行提名的生成式方法,旨在提升特征表征和端到端检测性能。
- TriDet: Temporal Action Detection with Relative Boundary Modeling(Ding Shi, Yujie Zhong, Qiong Cao, Lin Ma, Jia Li, Dacheng Tao, 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- DSPA: Dual-Spiral Pyramid Network with Multi-scale Attention for Temporal Action Localization(Xuhong Li, Shuai Zhang, Haiyu Liu, Hexiong Yang, Keyan Ren, 2025, 2025 International Joint Conference on Neural Networks (IJCNN))
- Transformer-Based Temporal Feature Pyramid Network for Temporal Action Proposal Generation(Yan Zhang, Tian Xiao, Lu Zhi, Feibi Lv, Jiajia Zhu, Liang Liu, Zhaoning Wang, Zixiang Di, Lexi Xu, Bei Li, 2025, 2025 IEEE International Conference on High Performance Computing and Communications (HPCC))
- DroFormer: temporal action detection with drop mechanism of attention(Xuejiao Lee, Chaoqun Hong, Xuebai Zhang, Yongfeng Chen, 2025, International Journal of Machine Learning and Cybernetics)
- TBT-Former: Learning Temporal Boundary Distributions for Action Localization(Thisara Rathnayaka, Uthayasanker Thayasivam, 2025, ArXiv)
- ReAct: Temporal Action Detection with Relational Queries(Ding Shi, Yujie Zhong, Qiong Cao, Jing Zhang, Lin Ma, Jia Li, Dacheng Tao, 2022, ArXiv)
- End-to-End Temporal Action Detection With Transformer(Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Shiwei Zhang, S. Bai, X. Bai, 2021, IEEE Transactions on Image Processing)
- Hierarchical Multi-Stage Transformer Architecture for Context-Aware Temporal Action Localization(Hayat Ullah, Arslan Munir, Oliver A. Nina, 2025, ArXiv)
- TALLFormer: Temporal Action Localization with Long-memory Transformer(Feng Cheng, Gedas Bertasius, 2022, ArXiv)
- Global-aware Pyramid Network with Boundary Adjustment for Anchor-free Temporal Action Detection(Zhuyuan Liang, Pengjun Zhai, Dulei Zheng, Yu Fang, 2022, Proceedings of the 2022 3rd International Conference on Control, Robotics and Intelligent System)
- Feature matters: Revisiting channel attention for Temporal Action Detection(Guo Chen, Yin-Dong Zheng, Wei Zhu, Jiahao Wang, Tong Lu, 2025, Pattern Recognit.)
- Prediction-Feedback DETR for Temporal Action Detection(Jihwan Kim, Miso Lee, Cheol-Ho Cho, Jihyun Lee, Jae-pil Heo, 2024, No journal)
- Faster-TAD: Towards Temporal Action Detection with Proposal Generation and Classification in a Unified Network(Shimin Chen, Chen Chen, Wei Li, Xunqiang Tao, Yan Guo, 2022, ArXiv)
- Dual DETRs for Multi-Label Temporal Action Detection(Yuhan Zhu, Guozhen Zhang, Jing Tan, Gangshan Wu, Limin Wang, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- BFSTAL: Bidirectional Feature Splitting With Cross-Layer Fusion for Temporal Action Localization(Jinglin Xu, Yaqi Zhang, Wenhao Zhou, Hongmin Liu, 2025, IEEE Transactions on Circuits and Systems for Video Technology)
- DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion(Sauradip Nag, Xiatian Zhu, Jiankang Deng, Yi-Zhe Song, Tao Xiang, 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV))
- IACFormer: a transformer framework with instantaneous average convolution for temporal action detection(Haiping Zhang, Dongyang Xu, Haixiang Lin, Dongjing Wang, Dongjin Yu, L. Guan, Wanjun Zhang, 2025, Applied Intelligence)
- ActionFormer: Localizing Moments of Actions with Transformers(Chen-Lin Zhang, Jianxin Wu, Yin Li, 2022, ArXiv)
- An Adaptive Dual Selective Transformer for Temporal Action Localization(Qiang Li, Guang Zu, Hui Xu, Jun Kong, Yanni Zhang, Jianzhong Wang, 2024, IEEE Transactions on Multimedia)
- Temporal Action Proposal Generation with Transformers(Lining Wang, Haosen Yang, Wenhao Wu, H. Yao, Hujie Huang, 2021, ArXiv)
- Multi-scale interaction transformer for temporal action proposal generation(Jiahui Shang, Ping Wei, Huan Li, Nanning Zheng, 2022, Image Vis. Comput.)
- LGAFormer: transformer with local and global attention for action detection(Haiping Zhang, Fuxing Zhou, Dongjing Wang, Xinhao Zhang, Dongjin Yu, L. Guan, 2024, The Journal of Supercomputing)
- Relaxed Transformer Decoders for Direct Action Proposal Generation(Jing Tan, Jiaqi Tang, Limin Wang, Gangshan Wu, 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV))
- KeyMamba: keyframe-enhanced state space model for efficient temporal action detection(Zikai Chen, Dan Wei, Peixing Li, Xiaolan Wang, 2025, Journal of Electronic Imaging)
- MambaTAD: When State-Space Models Meet Long-Range Temporal Action Detection(Hui Lu, Yi Yu, Shijian Lu, Deepu Rajan, Boon Poh Ng, A. Kot, Xudong Jiang, 2025, ArXiv)
- Local Temporal Mamba for Temporal Action Detection(Di Cui, Qi Zhang, 2025, 2025 6th International Conference on Computers and Artificial Intelligence Technology (CAIT))
- Transformer or Mamba for Temporal Action Localization? Insights from a Comprehensive Experimental Comparison Study(Zejian Zhang, Cristina Palmero, Sergio Escalera, 2025, No journal)
- Modeling long-term video semantic distribution for temporal action proposal generation(Tingting Han, Sicheng Zhao, Xiaoshuai Sun, Jun Yu, 2021, Neurocomputing)
动作边界精细化、提名生成与置信度优化
专注于解决动作定位中的核心难题:边界模糊。通过边界去噪、相位一致性、高分辨率建模、不确定性估计、以及置信度分数校准(Confidence Calibration)等技术,提高 Proposal 的生成质量和定位精度(IoU)。
- Learning Salient Boundary Feature for Anchor-free Temporal Action Localization(Chuming Lin, C. Xu, Donghao Luo, Yabiao Wang, Ying Tai, Chengjie Wang, Jilin Li, Feiyue Huang, Yanwei Fu, 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Action Category and Phase Consistency Regularization for High-Quality Temporal Action Proposal Generation(Yushu Liu, Weigang Zhang, Guorong Li, Qingming Huang, 2021, 2021 IEEE International Conference on Multimedia and Expo (ICME))
- MBGNet:Multi-branch boundary generation network with temporal context aggregation for temporal action detection(Xiaoying Pan, Nijuan Zhang, Hewei Xie, Shoukun Li, Tong Feng, 2024, Applied Intelligence)
- Capsule Boundary Network With 3D Convolutional Dynamic Routing for Temporal Action Detection(Chen Yaosen, Bing Guo, Yan Shen, Wei Wang, Weichen Lu, Xinhua Suo, 2022, IEEE Transactions on Circuits and Systems for Video Technology)
- SARNet: Self-attention Assisted Ranking Network for Temporal Action Proposal Generation(Jiahao Yu, Hong Jiang, 2021, 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC))
- FDDet: Frequency-Decoupling for Boundary Refinement in Temporal Action Detection(Xinnan Zhu, Yichen Zhu, Tixin Chen, Wentao Wu, Yuanjie Dang, 2025, ArXiv)
- Boundary-Denoising for Video Activity Localization(Mengmeng Xu, Mattia Soldan, Jialin Gao, Shuming Liu, Juan-Manuel P'erez-R'ua, Bernard Ghanem, 2023, ArXiv)
- Phase-Sensitive Model for Temporal Action Proposal Generation(Shijie Sun, Qingsong Zhao, Ziliang Ren, Lei Wang, Jun Cheng, 2021, 2020 IEEE International Conference on E-health Networking, Application & Services (HEALTHCOM))
- Temporal Action Proposal Generation With Action Frequency Adaptive Network(Yepeng Tang, Weining Wang, Chunjie Zhang, J. Liu, Yao Zhao, 2024, IEEE Transactions on Multimedia)
- Centerness-Aware Network for Temporal Action Proposal(Yuan Liu, Jingyuan Chen, Xinpeng Chen, Bing Deng, Jianqiang Huang, Xiansheng Hua, 2022, IEEE Transactions on Circuits and Systems for Video Technology)
- Boundary graph convolutional network for temporal action detection(Chen Yaosen, Bing Guo, Yan Shen, W. Wang, Weichen Lu, Xinhua Suo, 2021, Image Vis. Comput.)
- A Malleable Boundary Network for temporal action detection(Tian Wang, Boyao Hou, Zexian Li, Z. Li, Lei Huang, Baochang Zhang, H. Snoussi, 2022, Comput. Electr. Eng.)
- Boundary-Aware Proposal Generation Method for Temporal Action Localization(Hao Zhang, Chunyan Feng, Jiahui Yang, Zheng Li, Caili Guo, 2023, 2023 IEEE International Conference on Visual Communications and Image Processing (VCIP))
- RefineTAD: Learning Proposal-free Refinement for Temporal Action Detection(Yue Feng, Zhengye Zhang, Rong Quan, Limin Wang, Jie Qin, 2023, Proceedings of the 31st ACM International Conference on Multimedia)
- Refining Action Boundaries for One-stage Detection(Hanyuan Wang, M. Mirmehdi, D. Damen, Toby Perrett, 2022, 2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS))
- Boundary Adjusted Network Based on Cosine Similarity for Temporal Action Proposal Generation(Jingye Zheng, Dihu Chen, Haifeng Hu, 2021, Neural Processing Letters)
- Internal Location Assistance for Temporal Action Proposal Generation(Songsong Feng, Shengye Yan, 2024, ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))
- Attention-guided Boundary Refinement on Anchor-free Temporal Action Detection(Henglin Shi, Haoyu Chen, Guoying Zhao, 2023, No journal)
- BACNet: Boundary-Anchor Complementary Network for Temporal Action Detection(Zixuan Zhao, Dongqi Wang, Xu Zhao, 2022, 2022 IEEE International Conference on Multimedia and Expo (ICME))
- Advancing Temporal Action Localization with a Boundary Awareness Network(Jialiang Gu, Yang Yi, Min Wang, 2024, Electronics)
- Multi-Level Content-Aware Boundary Detection for Temporal Action Proposal Generation(Taiyi Su, Hanli Wang, Lei Wang, 2023, IEEE Transactions on Image Processing)
- Boundary-Recovering Network for Temporal Action Detection(Jihwan Kim, Jaehyun Choi, Yerim Jeon, Jae-pil Heo, 2024, ArXiv)
- BRTAL: Boundary Refinement Temporal Action Localization via Offset-Driven Diffusion Models(Hongmin Liu, Xueli Li, Bin Fan, Jinglin Xu, 2025, IEEE Transactions on Circuits and Systems for Video Technology)
- Context-BMN for Temporal Action Proposal Generation(Baoqing Tang, Shengye Yan, Yihua Ni, Yongjia Yang, Kang Pan, 2021, No journal)
- Temporal Action Proposal Generation with Background Constraint(Haosen Yang, Wenhao Wu, Lining Wang, Sheng Jin, Boyang Xia, H. Yao, Hujie Huang, 2021, No journal)
- PUNet: Temporal Action Proposal Generation With Positive Unlabeled Learning Using Key Frame Annotations(Noor ul Sehr Zia, O. Kayhan, J. V. Gemert, 2021, 2021 IEEE International Conference on Image Processing (ICIP))
- Anchor-Free Action Proposal Network with Uncertainty Estimation(Selen Pehlivan, J. Laaksonen, 2023, 2023 IEEE International Conference on Multimedia and Expo (ICME))
- Class-wise boundary regression by uncertainty in temporal action detection(Y. Chen, Mengjuan Chen, Qingyi Gu, 2022, IET Image Process.)
开集定位、零样本学习与多模态融合
研究如何处理训练中未见过的动作类别(Open-vocabulary/Zero-shot),通常利用视觉-语言大模型(如 CLIP)的知识迁移。同时,涵盖了结合音频、文本描述和图像分割等多模态信息来增强动作语义理解的文献。
- Training-Free Zero-Shot Temporal Action Detection with Vision-Language Models(Chaolei Han, Hongsong Wang, Jidong Kuang, Lei Zhang, Jie Gui, 2025, ArXiv)
- Concept-Guided Open-Vocabulary Temporal Action Detection(Song Wang, Rui Han, Wei Feng, 2025, Journal of Computer Science and Technology)
- Hierarchical Global–Local Fusion for One-stage Open-vocabulary Temporal Action Detection(Trung Thanh Nguyen, Yasutomo Kawanishi, Takahiro Komamizu, Ichiro Ide, 2025, ACM Transactions on Multimedia Computing, Communications and Applications)
- Test-Time Zero-Shot Temporal Action Localization(Benedetta Liberatori, Alessandro Conti, Paolo Rota, Yiming Wang, Elisa Ricci, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Zero-Shot Temporal Action Detection via Vision-Language Prompting(Sauradip Nag, Xiatian Zhu, Yi-Zhe Song, Tao Xiang, 2022, ArXiv)
- Toward Causal and Evidential Open-Set Temporal Action Detection(Zhuoyao Wang, Ruiwei Zhao, Rui Feng, Cheng Jin, 2025, IEEE Access)
- MGCA-Net: Multi-Grained Category-Aware Network for Open-Vocabulary Temporal Action Localization(Zhenying Fang, Richang Hong, 2025, ArXiv)
- Zero-Shot Temporal Action Detection by Learning Multimodal Prompts and Text-Enhanced Actionness(Asif Raza, Bang Yang, Yuexian Zou, 2024, IEEE Transactions on Circuits and Systems for Video Technology)
- UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection(Yingsen Zeng, Yujie Zhong, Chengjian Feng, Lin Ma, 2024, ArXiv)
- EAV-Mamba: Efficient Audio-Visual Representation Learning for Weakly-Supervised Temporal Action Localization(Quan Zhang, Jinwei Fang, Yuxin Qi, Mingyang Wan, Guojun Ma, Ke Zhang, Chun Yuan, 2025, 2025 IEEE International Conference on Multimedia and Expo (ICME))
- Cross-modal Consensus Network for Weakly Supervised Temporal Action Localization(Fa-Ting Hong, Jialuo Feng, Dan Xu, Ying Shan, Weishi Zheng, 2021, Proceedings of the 29th ACM International Conference on Multimedia)
- CG-SMFNet: Consensus-Guided Selective Multimodal Fusion for Weakly Supervised Temporal Action Localization(Peng Liu, Zitai Jiang, 2025, 2025 IEEE International Workshop on Multimedia Signal Processing (MMSP))
- CLIP-AE: CLIP-assisted Cross-view Audio-Visual Enhancement for Unsupervised Temporal Action Localization(Ruiqi Xia, Dan Jiang, Quan Zhang, Ke Zhang, Chun Yuan, 2025, ArXiv)
- A Multi-Modal Transformer Network for Action Detection(Matthew Korban, S. Acton, Peter A. Youngs, 2023, Pattern Recognit.)
低成本标注策略、半监督与点级监督
针对全监督标注昂贵的问题,探索点级监督(Point-level)、半监督学习(Semi-supervised)及数据编程框架。通过自我监督预训练和一致性约束,在极少标注的情况下保持较高的定位性能。
- Action-Agnostic Point-Level Supervision for Temporal Action Detection(Shuhei M. Yoshida, Takashi Shibata, Makoto Terao, Takayuki Okatani, Masashi Sugiyama, 2024, ArXiv)
- Boosting Point-Supervised Temporal Action Localization through Integrating Query Reformation and Optimal Transport(Mengnan Liu, Le Wang, Sanpin Zhou, Kun Xia, Xiaolong Sun, Gang Hua, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- SQL-Net: Semantic Query Learning for Point-Supervised Temporal Action Localization(Yu Wang, Sheng Zhao, Shiwei Chen, 2025, IEEE Transactions on Multimedia)
- Learning Reliable Dense Pseudo-Labels for Point-Level Weakly-Supervised Action Localization(Yuanjie Dang, G. Zheng, Peng Chen, Nan Gao, Ruohong Huan, Dongdong Zhao, Ronghua Liang, 2024, Neural Processing Letters)
- Semi-Supervised Temporal Action Proposal Generation via Exploiting 2-D Proposal Map(Weining Wang, Tianwei Lin, Dongliang He, Fu Li, Shilei Wen, Liang Wang, Jing Liu, 2021, IEEE Transactions on Multimedia)
- Pseudo label refining for semi-supervised temporal action localization(Lingwen Meng, Guobang Ban, Guanghui Xi, Siqi Guo, 2025, PLOS ONE)
- Self-Supervised Learning for Semi-Supervised Temporal Action Proposal(Xiang Wang, Shiwei Zhang, Zhiwu Qing, Yuanjie Shao, Changxin Gao, N. Sang, 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- ProTAL: A Drag-and-Link Video Programming Framework for Temporal Action Localization(Yuchen He, Jianbing Lv, Liqi Cheng, Lingyu Meng, Dazhen Deng, Yingcai Wu, 2025, Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems)
- SSPT‐Tr: Self‐Supervised Pre‐Training Transformer Based on Triplet for Temporal Action Detection(Qiongmin Zhang, Zeyuan Deng, Bingyi Ran, Shuqiu Tan, Xin Feng, 2025, IEEJ Transactions on Electrical and Electronic Engineering)
- Temporal action proposal generation with self-supervised pre-training transformer(Pan Pan, Xinyu Feng, Li Geng, 2023, No journal)
在线检测、高效计算与工业鲁棒性
侧重于实际应用场景,包括面向流式视频的在线动作检测(OAD)、模型压缩技术、端到端高效适配(如 LoRA/LoSA 微调)以及在电力、医疗、体育等特定垂直领域和噪声数据下的鲁棒性表现。
- Online Action Detection Incorporating an Additional Action Classifier(Min-Hang Hsu, Chen-Chien Hsu, Yin-Tien Wang, Shao-Kang Huang, Yi-Hsing Chien, 2024, Electronics)
- HCM: Online Action Detection With Hard Video Clip Mining(Siyu Liu, Jian Cheng, Ziying Xia, Zhilong Xi, Qin Hou, Zhicheng Dong, 2024, IEEE Transactions on Multimedia)
- Streamer temporal action detection in live video by co-attention boundary matching(Chenhao Li, Chenghai He, Hui Zhang, Jiacheng Yao, J. Zhang, L. Zhuo, 2022, International Journal of Machine Learning and Cybernetics)
- Text-driven online action detection(Manuel Benavent-Lledo, David Mulero-P'erez, David Ortiz-Pérez, José García Rodríguez, 2025, Integrated Computer-Aided Engineering)
- Temporal Action Detection Model Compression by Progressive Block Drop(Xiaoyong Chen, Yong Guo, Jiaming Liang, Sitong Zhuang, Runhao Zeng, Xiping Hu, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- DyLoRA-TAD: Dynamic Low-Rank Adapter for End-to-End Temporal Action Detection(Jixin Wu, Mingtao Zhou, Di Wu, Wenqi Ren, Jiatian Mei, Shu Zhang, 2025, Computers, Materials & Continua)
- LoSA: Long-Short-Range Adapter for Scaling End-to-End Temporal Action Localization(Akshita Gupta, Gaurav Mittal, Ahmed Magooda, Ye Yu, Graham W. Taylor, Mei Chen, 2024, 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV))
- Robust Temporal Action Localization With Meta Boundary Refinement(Jiahua Li, Kun-Juan Wei, Zhe Xu, Liejun Wang, Cheng Deng, 2025, IEEE Transactions on Multimedia)
- Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions(Runhao Zeng, Xiaoyong Chen, Jiaming Liang, Huisi Wu, Guangzhong Cao, Yong Guo, 2024, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Energy vs. Noise: Towards Robust Temporal Action Localization in Open-World(Chenyu Mu, Jiahua Li, Kun-Juan Wei, Cheng Deng, 2025, No journal)
- TAL4Tennis: Temporal Action Localization in Tennis Videos Using State Space Models(Ahmed Jouini, Mohammed Ali Lajnef, Faten Chaieb, A. Loth, 2025, No journal)
- Towards Real-World Power Grid Scenarios: Video Action Detection with Cross-scale Selective Context Aggregation(Lingwen Meng, Siwu Yu, Shasha Luo, Anjun Li, 2025, Inf. Technol. Control.)
- Relative Boundary Modeling: A High-Resolution Cricket Bowl Release Detection Framework with I3D Features(Jun Yu, Leilei Wang, Renjie Lu, Shuoping Yang, Renda Li, Lei Wang, Minchuan Chen, Qingying Zhu, Shaojun Wang, Jing Xiao, 2023, Proceedings of the 6th International Workshop on Multimedia Content Analysis in Sports)
- Opentad: a Unified Framework and Comprehensive Study of Temporal Action Detection(Shuming Liu, Chen Zhao, Fatimah Zohra, Mattia Soldan, A. Pardo, Mengmeng Xu, Lama Alssum, Merey Ramazanova, Juan Le'on Alc'azar, A. Cioppa, Silvio Giancola, Carlos Hinojosa, Bernard Ghanem, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW))
本报告统一了时序动作定位(TAL)领域的六大核心研究方向。整体趋势表现为:技术架构正经历从卷积神经网络向 Transformer 及 Mamba (SSM) 等能够处理长时序依赖的先进架构跨越;监督范式由重度依赖帧级标注的全监督向弱监督、点级监督及开集/零样本学习演进,以解决数据标注瓶颈;算法核心仍聚焦于边界精细化建模以提升定位精度;同时,研究视野已从实验室基准数据集扩展到实时在线监测、多模态融合及多样化的工业应用场景(如电力、医疗、体育),并开始注重模型的计算效率与复杂环境下的鲁棒性。
总计216篇相关文献
Temporal action localization is an important yet challenging task in video understanding. Typically, such a task aims at inferring both the action category and localization of the start and end frame for each action instance in a long, untrimmed video. While most current models achieve good results by using pre-defined anchors and numerous actionness, such methods could be bothered with both large number of outputs and heavy tuning of locations and sizes corresponding to different anchors. Instead, anchor-free methods is lighter, getting rid of redundant hyper-parameters, but gains few attention. In this paper, we propose the first purely anchor-free temporal localization method, which is both efficient and effective. Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module to gather more valuable boundary features for each proposal with a novel boundary pooling, and (iii) several consistency constraints to make sure our model can find the accurate boundary given arbitrary proposals. Extensive experiments show that our method beats all anchor-based and actionness-guided methods with a remarkable margin on THUMOS14, achieving state-of-the-art results, and comparable ones on ActivityNet v1.3. Code is available at https://github.com/TencentYoutuResearch/ActionDetection-AFSD.
Pseudo-label learning methods have been widely applied in weakly-supervised temporal action localization. Existing works directly utilize weakly-supervised base model to generate instance-level pseudo-labels for training the fully-supervised detection head. We argue that the noise in pseudo-labels would interfere with the learning of fully-supervised detection head, leading to significant performance leakage. Issues with noisy labels include:(1) inaccurate boundary localization; (2) undetected short action clips; (3) multiple adjacent segments incorrectly detected as one segment. To target these issues, we introduce a two-stage noisy label learning strategy to harness every potential useful signal in noisy labels. First, we propose a frame-level pseudo-label generation model with a context-aware denoising algorithm to refine the boundaries. Second, we introduce an online-revised teacher-student framework with a missing instance compensation module and an ambiguous instance correction module to solve the short-action-missing and many-to-one problems. Besides, we apply a high-quality pseudo-label mining loss in our online-revised teacher-student framework to add different weights to the noisy labels to train more effectively. Our model outperforms the previous state-of-the-art method in detection accuracy and inference speed greatly upon the THUMOS14 and ActivityNet v1.2 benchmarks.
Weakly-supervised Temporal Action Localization (WTAL) has achieved notable success but still suffers from a lack of temporal annotations, leading to a performance and framework gap compared with fully-supervised methods. While recent approaches employ pseudo labels for training, three key challenges: generating high-quality pseudo labels, making full use of different priors, and optimizing training methods with noisy labels remain unresolved. Due to these perspectives, we propose PseudoFormer, a novel two-branch framework that bridges the gap between weakly and fully-supervised Temporal Action Localization (TAL). We first introduce RickerFusion, which maps all predicted action proposals to a global shared space to generate pseudo labels with better quality. Subsequently, we leverage both snippet-level and proposal-level labels with different priors from the weak branch to train the regression-based model in the full branch. Finally, the uncertainty mask and iterative refinement mechanism are applied for training with noisy pseudo labels. PseudoFormer achieves state-of-the-art WTAL results on the two commonly used benchmarks, THUMOS14 and ActivityNet1.3. Besides, extensive ablation studies demonstrate the contribution of each component of our method.
Weakly supervised temporal action localization (WTAL) aims to localize action instances with only video-level labels for supervision. Recent methods convert category labels to natural language through prompting and utilize pre-trained vision-language models to generate text representation from natural language for supervision. This is because natural language can provide more prosperous and generalized semantic supervision to compensate for the lack of supervision in weakly supervised scenarios. However, it should be noted that current prompting methods face limitations in generating dynamic prompts that adapt to each video, which leads to difficulties in accurately aligning text and video representations. In this work, we propose a novel Text-Video Knowledge Guided Prompting (TVKP) framework for WTAL, which generates video-aware prompts based on text-video knowledge to enhance semantic alignment between text and video representations and introduce more video-related external category labels to enrich semantic supervision. We introduce the video-aware prompting (VAP) module to learn text-video knowledge from the joint distribution of text and video representations to generate video-aware text representation. Meanwhile, to make VAP more effectively learn text-video knowledge, a text-video contrastive loss is proposed to ensure semantic consistency between text and video representations. In addition, we propose the external knowledge prompting (EKP) module to introduce more video-related text labels from an external knowledge base to enrich prompts for accurate semantic alignment. Extensive experiments are conducted on three public datasets, THUMOS14, ActivityNet1.2, and ActivityNet1.3, demonstrating that our approach outperforms state-of-the-art methods.
Point-supervised Temporal Action Localization (PS-TAL) detects temporal intervals of actions in untrimmed videos with a label-efficient paradigm. However, most existing methods fail to learn action completeness without instance-level annotations, resulting in fragmentary region predictions. In fact, the semantic information of snippets is crucial for detecting complete actions, meaning that snippets with similar representations should be considered as the same action category. To address this issue, we propose a novel representation refinement framework with a semantic query mechanism to enhance the discriminability of snippet-level features. Concretely, we set a group of learnable queries, each representing a specific action category, and dynamically update them based on the video context. With the assistance of these queries, we expect to search for the optimal action sequence that agrees with their semantics. Besides, we leverage some reliable proposals as pseudo labels and design a refinement and completeness module to refine temporal boundaries further, so that the completeness of action instances is captured. Finally, we demonstrate the superiority of the proposed method over existing state-of-the-art approaches on THUMOS14 and ActivityNet13 benchmarks. Notably, thanks to completeness learning, our algorithm achieves significant improvements under more stringent evaluation metrics.
Weakly-Supervised Temporal Action Localization (WTAL) aims to identify the temporal boundaries and classify actions in untrimmed videos using only video-level labels during training. Despite recent progress, many existing approaches primarily follow a localization-by-classification pipeline, treating snippets as independent instances and thus exploiting only limited contextual information. Besides, these methods struggle to capture multi-scale temporal information and neglect both the internal temporal structures within videos and the semantic consistency between videos, resulting in misclassification and inaccurate localization. To address these limitations, we introduce a novel Temporal and Semantic Correlation Network (TSC-Net) for WTAL task, which can be trained end-to-end. First, we propose a Multi-Scale Features Integration Pyramid (MFIP) module to integrate multi-scale temporal features, effectively addressing the challenge of missed detections caused by short action durations. Furthermore, we design a Temporal Correlation Enhancement (TCE) branch to enhance segment correlations by video-level temporal structures to improve the completeness of action localization. Finally, a Dataset-Wide Semantic Awareness (DSA) branch is designed to construct and propagate a dataset-level action semantics bank, enhancing the model’s awareness of semantic consistency in actions. Extensive experiments show that TSC-Net outperforms most existing WTAL methods, achieving an average mAP of 46.3% on the THUMOS-14 dataset and 26.5% on the ActivityNet1.2 dataset. Detailed ablation studies further confirm the effectiveness of each component in our model. The code and models are publicly available at https://github.com/linkang-els/TSC-Net-main.
Weakly supervised temporal action localization (WTAL) aims to precisely locate action instances in given videos by video-level classification supervision, which is partly related to action classification. Most existing localization works directly utilize feature encoders pre-trained for video classification tasks to extract video features, resulting in non-targeted features that lead to incomplete or over-complete action localization. Therefore, we propose Generalized Contrast Learning Network (GCLNet), in which two novel strategies are proposed to improve the pre-trained features. First, to address the issue of over-completeness, GCLNet introduces text information with good context independence and category separability to enrich the expression of video features, as well as proposes a novel generalized contrastive learning approach for similarity metrics, which facilitates pulling closer the features belonging to the same category while pushing farther apart those from different categories. Consequently, it enables more compact intra-class feature learning and ensures accurate action localization. Second, to tackle the problem of incomplete, we exploit the respective advantages of RGB and Flow features in scene appearance and temporal motion expression, designing a hybrid attention strategy in GCLNet to enhance each channel features mutually. This process greatly improves the features through establishing cross-channel consensus. Finally, we conduct extensive experiments on THUMOS14 and ActivityNet1.2, respectively, and the results show that our proposed GCLNet can produce more representative action localization features.
Point-supervised Temporal Action Localization poses significant challenges due to the difficulty of identifying complete actions with a single-point annotation per action. Existing methods typically employ Multiple Instance Learning, which struggles to capture global temporal context and requires heuristic post-processing. In research on fully-supervised tasks, DETR-based structures have effectively addressed these limitations. However, it is nontrivial to merely adapt DETR to this task, encountering two major bottlenecks. (1) How to integrate point label information into the model and (2) How to select optimal decoder proposals for training in the absence of complete action segment annotations. To address this issue, we introduce an end-to-end framework by integrating Query Reformation and Optimal Transport (QROT). Specifically, we encode point labels through a set of semantic consensus queries, enabling effective focus on action-relevant snippets. Furthermore, we integrate an optimal transport mechanism to generate high-quality pseudo labels. These pseudo-labels facilitate precise proposals selection based on the Hungarian algorithm, significantly enhancing localization accuracy in point-supervised settings. Extensive experiments on the THUMOS14 and ActivityNet-v1.3 datasets demonstrate that our method outperforms existing MIL-based approaches, offering more stable and accurate temporal action localization in point-level supervision.
Temporal Action Localization (TAL) aims to detect the start and end timestamps of actions in a video. However, the training of TAL models requires a substantial amount of manually annotated data. Data programming is an efficient method to create training labels with a series of human-defined labeling functions. However, its application in TAL faces difficulties of defining complex actions in the context of temporal video frames. In this paper, we propose ProTAL, a drag-and-link video programming framework for TAL. ProTAL enables users to define key events by dragging nodes representing body parts and objects and linking them to constrain the relations (direction, distance, etc.). These definitions are used to generate action labels for large-scale unlabelled videos. A semi-supervised method is then employed to train TAL models with such labels. We demonstrate the effectiveness of ProTAL through a usage scenario and a user study, providing insights into designing video programming framework.
Weakly supervised temporal action localization aims to learn to locate actions in videos from video-level or point-level labels, avoiding the need for costly frame-level annotations. Unlike previous work that relies solely on visual modality information, we propose incorporating audio information into the weakly supervised temporal action localization task. While audio-visual localization tasks combine audio and visual information for video localization, temporal action localization often deals with action categories that have weak audio cues. To address this, we propose EAV-Mamba, the first audio-visual perception modeling method based on Mamba. Leveraging Mamba’s powerful audio-visual perception capabilities, we developed modules such as Audio-Perceptive Flow Enhancement, Audio-Perceptive RGB Enhancement, and Audio Self-Perceptive Enhancement. Extensive experiments on two publicly available temporal action localization datasets demonstrate that EAV-Mamba achieves efficient audio-visual perception modeling and state-of-the-art performance in weakly supervised temporal action localization tasks.
Temporal Action Localization (TAL) aims to classify and localize all actions within untrimmed videos. Existing TAL methods often struggle with inaccurate boundary predictions due to the similarity of action content and the uncertainty of boundaries between adjacent frames. Many of these methods rely on fixed or global proposal learning strategies, which lack a more refined method to improve localization accuracy. In this paper, we propose BRTAL, a new Boundary Refinement framework for TAL based on an offset-driven diffusion model, specifically designed to enhance action boundary precision through a refined approach iteratively. Unlike traditional TAL methods emphasizing global target predictions, BRTAL adopts a local refinement perspective by leveraging an offset-driven strategy. Specifically, our framework employs diffusion to iteratively generate local offsets between predictions and ground truth, gradually reducing these offsets to achieve better alignment with the ground truth. This refined approach is particularly effective in addressing the challenges of ambiguous boundaries frequently encountered in TAL, enabling BRTAL to achieve more refined boundary localization than existing methods. Furthermore, we introduce a lightweight yet powerful Temporal Context Modeling (TCM) module to enhance temporal information modeling for accurate action localization. TCM features a Temporal Representation Perception (TRP) layer, which captures temporal evolution and long-term contextual dependencies through a squeeze-and-excitation design combined with large convolutional kernels, ensuring robust temporal representation learning. Extensive experiments on THUMOS14, ActivityNet-1.3, and EPIC-KITCHEN 100 datasets highlight the significant advantages of BRTAL. Notably, BRTAL achieves an average mAP of 69.6% on THUMOS14, establishing a new state-of-the-art benchmark and demonstrating its outstanding boundary refinement capability.
Weakly-supervised fine-grained temporal action localization seeks to identify fine-grained action instances in untrimmed videos using only video-level labels. The primary challenge in this task arises from the subtle distinctions among various fine-grained action categories, which complicate the accurate localization of specific action instances. In this paper, we note that the context information embedded within the videos plays a crucial role in overcoming this challenge. However, we also find that effectively integrating context information across different scales is non-trivial, as not all scales provide equally valuable information for distinguishing fine-grained actions. Based on these observations, we propose a weakly-supervised fine-grained temporal action localization approach termed the Context Sensitive Network, which aims to fully leverage context information. Specifically, we first introduce a multi-scale context extraction module designed to efficiently capture multi-scale temporal contexts. Subsequently, we develop a scale-sensitive context gating module that facilitates interaction among multi-scale contexts and adaptively selects informative contexts based on varying video content. Extensive experiments conducted on two benchmark datasets, FineGym and FineAction, demonstrate that our approach achieves state-of-the-art performance.
Temporal Action Localization (TAL) aims to identify the boundaries of actions and their corresponding categories in untrimmed videos. Most existing methods simultaneously process past and future information, neglecting the inherently sequential nature of action occurrence. This confused treatment of past and future information hinders the model’s ability to understand action procedures effectively. To address these issues, we propose Bidirectional Feature Splitting with Cross-Layer Fusion for Temporal Action Localization (BFSTAL), a new bidirectional feature-splitting approach based on Mamba for the TAL task, composed of two core parts, Decomposed Bidirectionally Hybrid (DBH) and Cross-Layer Fusion Detection (CLFD), which explicitly enhances the model’s capacity to understand action procedures, especially to localize temporal boundaries of actions. Specifically, we introduce the Decomposed Bidirectionally Hybrid (DBH) component, which splits video features at a given timestamp into forward features (past information) and backward features (future information). DBH integrates three key modules: Bidirectional Multi-Head Self-Attention (Bi-MHSA), Bidirectional State Space Model (Bi-SSM), and Bidirectional Convolution (Bi-CONV). DBH effectively captures long-range dependencies by combining state-space modeling, attention mechanisms, and convolutional networks while improving spatial-temporal awareness. Furthermore, we propose Cross-Layer Fusion Detection (CLFD), which aggregates multi-scale features from different pyramid levels, enhancing contextual understanding and temporal action localization precision. Extensive experiments demonstrate that BFSTAL outperforms other methods on four widely used TAL benchmarks: THUMOS14, EPIC-KITCHENS 100, Charades, and MultiTHUMOS.
Temporal Action Localization (TAL) aims to accurately identify the start and end times of actions in untrimmed videos and classify them according to specific labels. However, the complexity and imbalance between target actions and background in video data make this task particularly challenging. Although relying on large amounts of finely annotated data has led to some progress in existing methods, the presence of noisy labels in large-scale annotations limits their application in open-world scenarios. To address this issue, we take the perspective of the data itself, modeling the different energy patterns exhibited by the action foreground and background in video data to enhance video content inference. Specifically, we propose the Energy-Driven Meta Purifier (EDMP) method, which utilizes a meta-learning training paradigm to avoid dependence on extensive and precise manual annotations. Under this pipeline, we use energy modeling to distinguish between different actions and backgrounds from the perspective of energy differences, thereby improving the model's robustness to category noise. Additionally, these energy-based distinctions are employed to further refine action boundaries, enhancing the model's robustness to boundary noise. Experiments on THUMOS14 and ActivityNet1.3 datasets show that EDMP effectively enhances the robustness of TAL models.
Temporal action localization aims to identify the boundaries of the action of interest in a video. Most existing methods take a two-stage approach: first, identify a set of action proposals; then, based on this set, determine the accurate temporal locations of the action of interest. However, the diversely distributed semantics of a video over time have not been well considered, which could compromise the localization performance, especially for ubiquitous short actions or events (e.g., a fall in healthcare and a traffic violation in surveillance). To address this problem, we propose a novel deep learning architecture, namely an adaptive template-guided self-attention network, to characterize the proposals adaptively with their relevant frames. An input video is segmented into temporal frames, within which the spatio-temporal patterns are formulated by a global–Local Transformer-based encoder. Each frame is associated with a number of proposals of different lengths as their starting frame. Learnable templates for proposals of different lengths are introduced, and each template guides the sampling for proposals with a specific length. It formulates the probabilities for a proposal to form the representation of certain spatio-temporal patterns from its relevant temporal frames. Therefore, the semantics of a proposal can be formulated in an adaptive manner, and a feature map of all proposals can be appropriately characterized. To estimate the IoU of these proposals with ground truth actions, a two-level scheme is introduced. A shortcut connection is also utilized to refine the predictions by using the convolutions of the feature map from coarse to fine. Comprehensive experiments on two benchmark datasets demonstrate the state-of-the-art performance of our proposed method: 32.6% mAP@IoU 0.7 on THUMOS-14 and 9.35% mAP@IoU 0.95 on ActivityNet-1.3.
Weakly-supervised temporal action localization (WS-TAL) aims to localize actions in untrimmed videos with only video-level labels. Most existing models follow the "localization by classification" procedure: locate temporal regions contributing most to the video-level classification. Generally, they process each snippet (or frame) individually and thus overlook the fruitful temporal context relation. Here arises the single snippet cheating issue: "hard" snippets are too vague to be classified. In this paper, we argue that learning by comparing helps identify these hard snip-pets and we propose to utilize snippet Contrastive learning to Localize Actions, CoLA for short. Specifically, we propose a Snippet Contrast (SniCo) Loss to refine the hard snippet representation in feature space, which guides the network to perceive precise temporal boundaries and avoid the temporal interval interruption. Besides, since it is in-feasible to access frame-level annotations, we introduce a Hard Snippet Mining algorithm to locate the potential hard snippets. Substantial analyses verify that this mining strategy efficaciously captures the hard snippets and SniCo Loss leads to more informative feature representation. Extensive experiments show that CoLA achieves state-of-the-art results on THUMOS’14 and ActivityNet v1.2 datasets.
Weakly-supervised temporal action localization (WTAL) aims to localize and classify action instances in untrimmed videos with only video-level labels available. Despite the remarkable success of existing methods, whose generated proposals are commonly far more than the ground-truth action instances, it still makes sense to improve the ranking accuracy of the generated proposals since users in real-world scenarios usually prioritize the action proposals with the highest confidence scores. The inaccuracy of the proposal ranking mainly comes from two aspects: For one thing, the traditional proposal generation manner entirely relies on snippet-level perception, resulting in a significant yet unnoticed gap with the target of proposal-level localization. For another, existing methods commonly employ a hand-crafted proposal generation manner, a post-process that does not participate in model optimization. To address the above issues, we propose an end-to-end trained two-stage method, termed as Learning Proposal-aware Re-ranking (LPR) for WTAL. In the first stage, we design a proposal-aware feature learning module to inject the proposal-aware contextual information into each snippet, and then the enhanced features are utilized for predicting initial proposals. Furthermore, to perform effective and efficient proposal re-ranking, in the second stage, we contrast the proposals attached with high confidence scores with our constructed multi-scale foreground/background prototypes for further optimization. Evaluated by both the vanilla and Top- $k$ mAP metrics, results of extensive experiments on two popular benchmarks demonstrate the effectiveness of our proposed method.
Most modern approaches in temporal action localization divide this problem into two parts: (i) short-term feature extraction and (ii) long-range temporal boundary localization. Due to the high GPU memory cost caused by processing long untrimmed videos, many methods sacrifice the representational power of the short-term feature extractor by either freezing the backbone or using a small spatial video resolution. This issue becomes even worse with the recent video transformer models, many of which have quadratic memory complexity. To address these issues, we propose TALLFormer, a memory-efficient and end-to-end trainable Temporal Action Localization Transformer with Long-term memory. Our long-term memory mechanism eliminates the need for processing hundreds of redundant video frames during each training iteration, thus, significantly reducing the GPU memory consumption and training time. These efficiency savings allow us (i) to use a powerful video transformer feature extractor without freezing the backbone or reducing the spatial video resolution, while (ii) also maintaining long-range temporal boundary localization capability. With only RGB frames as input and no external action recognition classifier, TALLFormer outperforms previous state-of-the-arts by a large margin, achieving an average mAP of 59.1% on THUMOS14 and 35.6% on ActivityNet-1.3. The code is public available: https://github.com/klauscc/TALLFormer.
Weakly-supervised temporal action localization task is to identify action cate-gories and start and end times in unedited videos. How to achieve feature cali-bration between different modalities in this task, and how to further optimize action boundaries based on the similarity of action common sequences remains an urgent problem to be solved. Based on the above issues, we propose a novel network framework, weakly supervised temporal action localization via feature calibration-assisted sequence comparison (FCSC). The core of the FCSC frame-work lies in the Multi-Modal Feature Calibration Module (MFCM), which utilizes global and local contextual information from the primary and auxiliary modali-ties to enhance RGB and FLOW features, respectively, achieving deep feature calibration. In addition, the framework introduces an improved distinguishable edit distance metric to sequence similarity optimize (SSO) and maximum con-sistent subsequence (MCS) to narrow the gap between classification and locali-zation tasks. After multiple experiments, it has been proven that FCSC achieved maps of 47.7% and 27.9%, respectively on the THUMOS14 and ActiveNet1.2 temporal action recognition benchmark test sets, fully verifying the effective-ness of the model.
Temporal action localization is a fundamental task in video understanding that focuses on classifying and temporally localizing action instances in untrimmed videos. Compared to temporal action localization, the Weakly supervised Temporal Action Localization (WTAL) task presents greater challenges, as its training data lacks detailed information about action boundaries. Existing WTAL methods ignore the complementary relationship between modalities and the dependency between snippets, resulting in inaccurate localization results. To solve these issues, we propose a Collaborative Hierarchical Aggregation Network (CHA-Net). Specifically, we first use a modality complementary module to learn the synergies between modalities. Then, a collaborative enhance module is proposed to remove the information irrelevant to actions in RGB modality. Finally, a hierarchical aggregation module is proposed to capture the complete temporal information of action instances to better mine the temporal dependencies between snippets. Extensive experiments on THUMOS14, ActivityNet1.2, and ActivityNet1.3 datasets demonstrate the effectiveness of our method. Compared with F3-Net (TMM2024, Avg{0.1:0.5}) and SPCC-Net (TMM2024, Avg{0.1:0.7}) on the THUMOS14 dataset, the proposed method can achieve improvements of 3.2% and 2.4%, respectively.
Most popular Feature Pyramid Networks (FPN) for temporal action localization (TAL) in videos encode multi-scale features during downsampling, which inevitably bring fine-grained feature loss. In addition, most popular TAL models that directly apply self-attention mechanisms, which impose equal importance on consecutive frames might lead to feature homogenization. In an attempt to address these problems, we propose a Dual-Spiral Pyramid Network with Multi-scale Attention (DSPA), which consists of three main modules: Feature Enhancement Module (FEM), Dual-Spiral Feature Pyramid Network (Ds-FPN), and Multi-Scale Dual-Spiral Attention Convolution Module (Ds-MAC). To be specific, we use the FEM to enhance features by exploring the relations on different temporal dimensions and channel dimensions. Moreover, the Ds-FPN integrates high-resolution temporal features from the base layer with fine-grained features processed by the FEM and sequentially propagates these fused features across adjacent layers to construct a hierarchical multi-scale video representation. Furthermore, the Ds-MAC adopts a hierarchical architecture with long-term and short-term temporal modeling and residual learning to capture global context and fine details, while enhancing feature diversity and reducing convergence risk by advanced nonlinear transformations. Extensive experimental results demonstrate that our approach achieves state-of-the-art performance on two public datasets, THUMOS14 and EPIC-Kitchens 100.
: Deep learning models need to encode both local and global temporal dependencies for accurate temporal action localization (TAL). Recent approaches have relied on Transformer blocks, which has a quadratic complexity. By contrast, Mamba blocks have been adapted for TAL due to their comparable performance and lower complexity. However, various factors can influence the choice between these models, and a thorough analysis of them can provide valuable insights into the selection process. In this work, we analyze the Trans-former block, Mamba block, and their combinations as temporal feature encoders for TAL, measuring their overall performance, efficiency, and sensitivity across different contexts. Our analysis suggests that Mamba blocks should be preferred due to their performance and efficiency. Hybrid encoders can serve as an alternative choice when sufficient computational resources are available.
Weakly-supervised temporal action localization (WTAL) aims to identify and localize action instances in untrimmed videos using only video-level labels. Existing methods typically rely on original features from frozen pre-trained encoders designed for trimmed action classification (TAC) tasks, which inevitably introduces task discrepancy. Additionally, these methods often overlook the importance of considering action consistency from multiple perspectives, specifically the consistency in action processes and action semantics, both of which are crucial for the model's understanding of actions. To address these issues, we propose a novel WTAL method based on similar modality enhancement and action consistency learning (SEAL). First, we construct global descriptors for each action category, and use the pseudo-labels generated based on these descriptors to guide the model in learning more consistent representations, thereby mitigating task discrepancy. Second, we design two types of losses to achieve action consistency learning: process consistency loss, which penalizes candidate proposals that deviate from the action center to ensure the completeness of the action process, and semantic consistency loss, which employs local descriptors to help proposals of the same action category (especially those with apparent semantic confusion) learn similar feature distributions. Extensive experiments on the THUMOS14 and ActivityNet datasets demonstrate the superior performance of the proposed method compared to state-of-the-art methods.
Weakly supervised temporal action localization (WTAL) targets the joint classification of action categories and precise delineation of their temporal boundaries in untrimmed videos while relying only on video-level labels. The absence of frame-level supervision inevitably causes two key difficulties: (i) incomplete localization of action segments and (ii) confusion between foreground and background frames. To overcome these challenges, we propose the Consensus-Guided Selective Multimodal Fusion Network (CG-SMFNet). First, a Selective Fusion Module (SFM) exploits the complementarity of multimodal cues to distill rich semantic representations. Second, a Consensus Attention Mechanism (CAM) dynamically assigns fusion weights to the three modality branches and enables bidirectional information exchange, ensuring a more holistic capture of action content. Finally, a Discrepant Expansion Mechanism (DEM) introduces a semantic contrast loss that enlarges the distance between foreground segments and semantically similar background regions, further sharpening localization accuracy. Extensive experiments on public benchmarks verify that CG-SMFNet achieves state-of-the-art performance under weak supervision.
Temporal Action Localization (TAL) has garnered significant attention in information retrieval. Existing supervised or weakly supervised methods heavily rely on labeled temporal boundaries and action categories, which are labor-intensive and time-consuming. Consequently, unsupervised temporal action localization (UTAL) has gained popularity. However, current methods face two main challenges: 1) Classification pre-trained features overly focus on highly discriminative regions; 2) Solely relying on visual modality information makes it difficult to determine contextual boundaries. To address these issues, we propose a CLIP-assisted cross-view audiovisual enhanced UTAL method. Specifically, we introduce visual language pre-training (VLP) and classification pre-training-based collaborative enhancement to avoid excessive focus on highly discriminative regions; we also incorporate audio perception to provide richer contextual boundary information. Finally, we introduce a self-supervised cross-view learning paradigm to achieve multi-view perceptual enhancement without additional annotations. Extensive experiments on two public datasets demonstrate our model's superiority over several state-of-the-art competitors.
No abstract available
The training of temporal action localization models relies heavily on a large amount of manually annotated data. Video annotation is more tedious and time-consuming compared with image annotation. Therefore, the semi-supervised method that combines labeled and unlabeled data for joint training has attracted increasing attention from academics and industry. This study proposes a method called pseudo-label refining (PLR) based on the teacher-student framework, which consists of three key components. First, we propose pseudo-label self-refinement which features in a temporal region interesting pooling to improve the boundary accuracy of TAL pseudo label. Second, we design a module named boundary synthesis to further refined temporal interval in pseudo label with multiple inference. Finally, an adaptive weight learning strategy is tailored for progressively learning pseudo labels with different qualities. The method proposed in this study uses ActionFormer and BMN as the detector and achieves significant improvement on the THUMOS14 and ActivityNet v1.3 datasets. The experimental results show that the proposed method significantly improve the localization accuracy compared to other advanced SSTAL methods at a label rate of 10% to 60%. Further ablation experiments show the effectiveness of each module, proving that the PLR method can improve the accuracy of pseudo-labels obtained by teacher model reasoning.
In recent years, the wide application of weakly supervised temporal action localization (WTAL) technology has accelerated the efficiency of video analysis. However, this domain continues to confront numerous challenges, especially due to the lack of precise temporal annotations. Consequently, this technique becomes highly susceptible to contextual background noise and overly reliant on prominent action segments, leading to less-than-ideal action localization. To alleviate this problem, we propose the contrastive learning-based action salience network (CLASNet), comprising two pivotal modules: feature contrast separation module (FCSM) and boundary refinement module (BRM). FCSM utilizes a contrastive learning approach to effectively separate action features from background features, thereby enhancing the discriminability of features. Concurrently, BRM introduces boundary refinement loss to rectify the temporal boundaries of actions, thereby further elevating the precision of temporal localization. The collaborative functioning of these two key modules effectively resolves the ambiguity issues in temporal action localization under weak supervision, markedly enhancing localization accuracy. Furthermore, CLASNet is versatile and can be integrated into different WTAL frameworks, achieving enhanced localization performance while preserving the original end-to-end training manner. Utilizing three large-scale benchmark action localization datasets, THUMOS14, ActivityNet v1.2, and ActivityNet v1.3, we embed CLASNet into various cutting-edge weakly supervised temporal action localization methods, such as CO2-Net, DELU, and ACRNet, for empirical substantiation. The experimental outcomes reveal that CLASNet significantly enhances the efficacy of these methods in action localization, offering novel perspectives for the advancement of temporal action localization technology.
Weakly supervised temporal action localization (WTAL) aims to identify action instances in untrimmed videos with only video-level supervision. Despite recent advances in WTAL methods, achieving accurate boundary localization remains a significant challenge. A key reason is that WTAL networks following a localization-by-classification pipeline tend to focus on the most discriminative features, neglecting some ambiguous features that may contain action instances. To make the WTAL model focus on low-discriminative features that include action instances, we propose an action-to-action diffusion (ActionDiff) network. This network leverages the smoothness of data generated by the diffusion model, using the diffusion model to output smooth and high-quality features that weaken the discriminative action features from the base branch, thereby enhancing the performance of the WTAL task. First, we develop a topk-based masking strategy to generate binary masks that serve as pseudo-labels for diffusion model learning. Then, we propose a diffusion branch to generate high-quality latent action space by iteratively removing noise guided by the designed pseudo-labels and conditional information. To enhance the diffusion branch’s capability to generate human behavioral features, we design an action-related conditional strategy to obtain conditional information and use it to guide the modeling of human behavior knowledge by the diffusion branch. Our comprehensive experiments demonstrate that the proposed method achieves a promising performance on three benchmark datasets: THUMOS14, ActivityNet v1.2, and v1.3.
Temporal Action Localization (TAL) aims to localize the start and end timestamps of actions with specific categories in untrimmed videos. Despite great success, noisy action boundary labels may be included due to the inherent subjectivity of manual annotations. This can lead TAL models to learn inaccurate action boundaries during training, potentially impairing their localization performance. To systematically analyze and enhance the TAL models’ robustness against noisy action boundary labels, we introduce a new task termed TAL with Noisy Label. We demonstrate that introducing even minimal random noise to action boundary labels in training data can substantially degrade the performance of leading TAL methods, thereby underscoring their vulnerability to noisy action boundary labels. To be specific, we propose a novel plug-and-play method called Energy-based Meta Boundary Refinement (EMBR), where a meta-learning pipeline is employed to rectify noisy action boundary labels, ameliorating the misguidance of noisy labels on model training. Under this meta-learning pipeline, EMBR utilizes an energy function to calculate the magnitude of label noise and re-weights samples, assigning lower weights to samples with higher noise, alleviating the impact of noisy samples on model training. In addition, considering the energy difference between action and background segments, an energy-based loss function is proposed to achieve larger energy differences across the boundary, assisting in the boundary refinement. Experimental results on the THUMOS14, ActivityNet1.3, and HACS datasets demonstrate the effectiveness of EMBR in enhancing the robustness of TAL models.
We present our solution to the BinEgo-360 Challenge at ICCV 2025, which focuses on temporal action localization (TAL) in multi-perspective and multi-modal video settings. The challenge provides a dataset containing panoramic, third-person, and egocentric recordings, annotated with fine-grained action classes. Our approach is built on the Temporal Shift Module (TSM), which we extend to handle TAL by introducing a background class and classifying fixed-length non-overlapping intervals. We employ a multi-task learning framework that jointly optimizes for scene classification and TAL, leveraging contextual cues between actions and environments. Finally, we integrate multiple models through a weighted ensemble strategy, which improves robustness and consistency of predictions. Our method is ranked first in both the initial and extended rounds of the competition, demonstrating the effectiveness of combining multi-task learning, an efficient backbone, and ensemble learning for TAL.
Temporal action localization (TAL) is a research hotspot in video understanding, which aims to locate and classify actions in videos. However, existing methods have difficulties in capturing long-term actions due to focusing on local temporal information, which leads to poor performance in localizing long-term temporal sequences. In addition, most methods ignore the boundary importance for action instances, resulting in inaccurate localized boundaries. To address these issues, this paper proposes a state space model for temporal action localization, called Separated Bidirectional Mamba (SBM), which innovatively understands frame changes from the perspective of state transformation. It adapts to different sequence lengths and incorporates state information from the forward and backward for each frame through forward Mamba and backward Mamba to obtain more comprehensive action representations, enhancing modeling capabilities for long-term temporal sequences. Moreover, this paper designs a Boundary Correction Strategy (BCS). It calculates the contribution of each frame to action instances based on the pre-localized results, then adjusts weights of frames in boundary regression to ensure the boundaries are shifted towards the frames with higher contributions, leading to more accurate boundaries. To demonstrate the effectiveness of the proposed method, this paper reports mean Average Precision (mAP) under temporal Intersection over Union (tIoU) thresholds on four challenging benchmarks: THUMOS13, ActivityNet-1.3, HACS, and FineAction, where the proposed method achieves mAPs of 73.7%, 42.0%, 45.2%, and 29.1%, respectively, surpassing the state-of-the-art approaches.
The purpose of weakly-supervised temporal action localization (WTAL) task is to simultaneously classify and localize action instances in untrimmed videos with only video-level labels. Previous works fail to extract multi-scale temporal features to identify action instances with different durations, and they do not fully use the temporal cues of action video to learn discriminative features. In addition, the classifiers trained by current methods usually focus on easy-to-distinguish snippets while ignoring other semantically ambiguous features, which leads to incomplete and over-complete localization. To address these issues, we introduce a new Snippet-inter Difference Attention Network (SDANet) for WTAL, which can be trained end-to-end. Specifically, our model presents three modules, with primary contributions lying in the snippet-inter difference attention (SDA) module and potential feature mining (PFM) module. Firstly, we construct a simple multi-scale temporal feature fusion (MTFF) module to generate multi-scale temporal feature representation, so as to help the model better detect short action instances. Secondly, we consider the temporal cues of video features and design SDA module based on the Transformer to capture global discriminative features for each modality based on multi-scale features. It calculates the differences between temporal neighbor snippets in each modality to explore salient-difference features, and then utilizes them to guide correlation modeling. Thirdly, after learning discriminative features, we devise PFM module to excavate potential action and background snippets from ambiguous features. By contrastive learning, potential actions are forced closer to discriminative actions and away from the background, thereby learning more accurate action boundaries. Finally, two losses (i.e., similarity loss and reconstruction loss) are further developed to constrain the consistency between two modalities and help the model retain original feature information for better localization results. Extensive experiments show that our model achieves better performance against current WTAL methods on three datasets, i.e., THUMOS14, ActivityNet1.2 and ActivityNet1.3.
Weakly supervised temporal action localization uses video-level labels to locate action segments in unedited long videos, which is widely used in various scenarios, but faces the challenges of feature redundancy and boundary blur. Therefore, this paper adopts a two-stage optimization method to construct a proposal-based generation and classification network by integrating global-local context awareness mechanism and boundary dynamic optimization strategy, which is used to solve the problem of feature modeling redundancy and boundary location ambiguity in weakly supervised temporal action localization tasks.
: Temporal action localization is a classic computer vision problem in video understanding with a wide range of applications. In the context of sports videos, it is integrated into most of the current solutions used by coaches, broadcasters and game specialists to assist in performance analysis, strategy development, and enhancing the viewing experience. This work presents an application study on temporal action localization for tennis broadcast videos. We study and evaluate a foundational video understanding model for identifying tennis actions in match footage. We explore its architecture, specifically the state space model, from video input to the prediction of temporal segments and classification labels. Our experiments provide findings and interpretations of the model’s performance on tennis data. We achieved an average mean Average Precision (mAP) of 66 . 14% over all thresholds on the TenniSet dataset, surpassing the other methods, and 96 . 16% on our private French Open dataset.
In the field of comprehensive video understanding, Temporal Action Localization (TAL) plays a vital role by precisely identifying when actions begin and end in untrimmed video sequences, enabling more accurate analysis of complex temporal dynamics. Despite their importance, existing datasets and algorithms in the sports field face significant challenges. Current datasets independently consider single-person or two-person event categories, overlooking the simultaneous occurrence of multiple actions and interactions in real-world environments. Existing studies primarily concentrate on mainstream sports like soccer, traditional basketball, and volleyball, whereas numerous sporting disciplines such as <inline-formula> <tex-math notation="LaTeX">$3\!\times \!3$ </tex-math></inline-formula> basketball remain underserved with respect to specialized datasets and custom analytical frameworks. To address this issue, we propose a new real-world <inline-formula> <tex-math notation="LaTeX">$3 \times 3$ </tex-math></inline-formula> basketball TAL (TAL<inline-formula> <tex-math notation="LaTeX">$3 \times 3$ </tex-math></inline-formula>) dataset and algorithm: TAL<inline-formula> <tex-math notation="LaTeX">$3 \times 3$ </tex-math></inline-formula> dataset includes 3 single-person action classes (such as shooting a basketball) and 5 two-person interaction classes (such as passing the ball between players) with 633k human bounding boxes and 99572 action instances on 106k frames. To benchmark TAL<inline-formula> <tex-math notation="LaTeX">$3 \times 3$ </tex-math></inline-formula>, we develop TAL<inline-formula> <tex-math notation="LaTeX">$3 \times 3$ </tex-math></inline-formula> algorithm consisting of two distinct phases: 1) generation of action proposals, and 2) construction of representations for each proposal, followed by classification into specific interaction categories or background. Extensive experiments demonstrate that our method achieves remarkable performance with 60.09% mAP and 78.93% accuracy on our dataset, substantially outperforming existing approaches and establishing new benchmarks for complexity-aware action localization in team sports. We expect the TAL<inline-formula> <tex-math notation="LaTeX">$3 \times 3$ </tex-math></inline-formula> will contribute to temporal action localization and basketball game analytics, while advancing the development of temporal contextual modeling techniques in the field of TAL. The dataset is available at <uri>https://github.com/open-starlab/TAL3×3</uri>
Inspired by the recent success of transformers and multi-stage architectures in video recognition and object detection domains. We thoroughly explore the rich spatio-temporal properties of transformers within a multi-stage architecture paradigm for the temporal action localization (TAL) task. This exploration led to the development of a hierarchical multi-stage transformer architecture called PCL-Former, where each subtask is handled by a dedicated transformer module with a specialized loss function. Specifically, the Proposal-Former identifies candidate segments in an untrimmed video that may contain actions, the Classification-Former classifies the action categories within those segments, and the Localization-Former precisely predicts the temporal boundaries (i.e., start and end) of the action instances. To evaluate the performance of our method, we have conducted extensive experiments on three challenging benchmark datasets: THUMOS-14, ActivityNet-1.3, and HACS Segments. We also conducted detailed ablation experiments to assess the impact of each individual module of our PCL-Former. The obtained quantitative results validate the effectiveness of the proposed PCL-Former, outperforming state-of-the-art TAL approaches by 2.8%, 1.2%, and 4.8% on THUMOS14, ActivityNet-1.3, and HACS datasets, respectively.
Open-Vocabulary Temporal Action Localization (OV-TAL) aims to recognize and localize instances of any desired action categories in videos without explicitly curating training data for all categories. Existing methods mostly recognize action categories at a single granularity, which degrades the recognition accuracy of both base and novel action categories. To address these issues, we propose a Multi-Grained Category-Aware Network (MGCA-Net) comprising a localizer, an action presence predictor, a conventional classifier, and a coarse-to-fine classifier. Specifically, the localizer localizes category-agnostic action proposals. For these action proposals, the action presence predictor estimates the probability that they belong to an action instance. At the same time, the conventional classifier predicts the probability of each action proposal over base action categories at the snippet granularity. Novel action categories are recognized by the coarse-to-fine classifier, which first identifies action presence at the video granularity. Finally, it assigns each action proposal to one category from the coarse categories at the proposal granularity. Through coarse-to-fine category awareness for novel actions and the conventional classifier's awareness of base actions, multi-grained category awareness is achieved, effectively enhancing localization performance. Comprehensive evaluations on the THUMOS'14 and ActivityNet-1.3 benchmarks demonstrate that our method achieves state-of-the-art performance. Furthermore, our MGCA-Net achieves state-of-the-art results under the Zero-Shot Temporal Action Localization setting.
No abstract available
Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training. Without the boundary information of action segments, existing methods mostly rely on multiple instance learning (MIL), where the predictions of unlabeled instances (i.e., video snippets) are supervised by classifying labeled bags (i.e., untrimmed videos). However, this formulation typically treats snippets in a video as independent instances, ignoring the underlying temporal structures within and across action segments. To address this problem, we propose ASM-Loc, a novel WTAL framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods. Our framework entails three segment-centric components: (i) dynamic segment sampling for compensating the contribution of short actions; (ii) intra- and inter-segment attention for modeling action dynamics and capturing temporal dependencies; (iii) pseudo instance-level supervision for improving action boundary prediction. Furthermore, a multi-step refinement strategy is proposed to progressively improve action proposals along the model training process. Extensive experiments on THUMOS-14 and ActivityNet-v1.3 demonstrate the effectiveness of our approach, establishing new state of the art on both datasets. The code and models are publicly available at https://github.com/boheumd/ASM-Loc.
In this paper, we present a one-stage framework TriDet for temporal action detection. Existing methods often suffer from imprecise boundary predictions due to the ambiguous action boundaries in videos. To alleviate this problem, we propose a novel Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. In the feature pyramid of TriDet, we propose an efficient Scalable-Granularity Perception (SGP) layer to mitigate the rank loss problem of self-attention that takes place in the video features and aggregate information across different temporal granularities. Benefiting from the Trident-head and the SGP-based feature pyramid, TriDet achieves state-of-the-art performance on three challenging benchmarks: THUMOS14, HACS and EPIC-KITCHEN 100, with lower computational costs, compared to previous methods. For example, TriDet hits an average mAP of 69.3% on THUMOS14, outperforming the previous best by 2.5%, but with only 74.6% of its latency. The code is released to https://github.com/dingfengshi/TriDet.
Temporal action detection (TAD) is a fundamental video understanding task that aims to identify human actions and localize their temporal boundaries in videos. Although this field has achieved remarkable progress in recent years, further progress and real-world applications are impeded by the absence of a standardized framework. Currently, different methods are compared under different implementation settings, evaluation protocols, etc., making it difficult to assess the real effectiveness of a specific technique. To address this issue, we propose OpenTAD, a unified TAD framework consolidating 16 different TAD methods and 9 standard datasets into a modular codebase. In OpenTAD, minimal effort is required to replace one module with a different design, train a feature-based TAD model in end-to-end mode, or switch between the two. OpenTAD also facilitates straightforward benchmarking across various datasets and enables fair and in-depth comparisons among different methods. With OpenTAD, we comprehensively study how innovations in different network components affect detection performance and identify the most effective design choices through extensive experiments. This study has led to a new state-of-the-art TAD method built upon existing techniques for each component. Our code and models are available at https://github.com/sming256/OpenTAD.
Temporal action detection (TAD) aims to determine the semantic label and the temporal interval of every action instance in an untrimmed video. It is a fundamental and challenging task in video understanding. Previous methods tackle this task with complicated pipelines. They often need to train multiple networks and involve hand-designed operations, such as non-maximal suppression and anchor generation, which limit the flexibility and prevent end-to-end learning. In this paper, we propose an end-to-end Transformer-based method for TAD, termed TadTR. Given a small set of learnable embeddings called action queries, TadTR adaptively extracts temporal context information from the video for each query and directly predicts action instances with the context. To adapt Transformer to TAD, we propose three improvements to enhance its locality awareness. The core is a temporal deformable attention module that selectively attends to a sparse set of key snippets in a video. A segment refinement mechanism and an actionness regression head are designed to refine the boundaries and confidence of the predicted instances, respectively. With such a simple pipeline, TadTR requires lower computation cost than previous detectors, while preserving remarkable performance. As a self-contained detector, it achieves state-of-the-art performance on THUMOS14 (56.7% mAP) and HACS Segments (32.09% mAP). Combined with an extra action classifier, it obtains 36.75% mAP on ActivityNet-1.3. Code is available at https://github.com/xlliu7/TadTR.
Video embedding is the pivot in Temporal Action Detection (TAD). Once the video embedding can robustly capture the essence of actions and perceive activities in complex scenes, the TAD model can more accurately localize action boundaries. Currently, video embedding is typically based on rule-based pixel convolution or cube-based transformer, wherein structured semantic information is intertwined, leading to the submergence of crucial spatial semantic information, such as the intrinsic motion of key semantic objects and interactions among semantic objects. To address these limitations, it is imperative to explore alternative approaches. With the remarkable performance of general semantic segmentation models in visual representation, we introduce the general segmentation model SEEM into the video embedding paradigm, constructing a semantically structured representation from perceptual semantics to cognitive semantics. To more effectively utilize SEEM for structured video representation, we designed the Semantic Adapter (Sem-Adapter) as a bridge to connect the two models. Firstly, we design a Self-Motion Module (SMM) to pay attention to the self-motion of key semantic regions. Secondly, we propose a Mutual Relation Module (MRM) to construct the interactions between semantic regions. Extensive experiments on ActivityNet-1.3, THUMOS-14 and EPIC-Kitchens-100 reveal that our method significantly outperforms state-of-the-art methods under the same input modality, and our method improves the average mAP from 60.6% to 64.2% on THUMOS-14 with the same backbone. The code is available on https://github.com/shouxiaozixuan/semtad.
No abstract available
By detecting abnormal violation event in surveillance videos, the safety management capabilities in high-risk power operations can be improved. This research constructs an intelligent abnormal event detection technology using deep learning algorithms, aiming to improve the detection accuracy of anomaly event. This research improves the parameter setting method and fully connected layer of three-dimensional convolutional networks to enhance their ability to recognize three-dimensional features. An improved algorithm is adopted as the basic structure of temporal action detection technology. Frame interpolation is applied to improve the accuracy of temporal action detection. A monitoring video anomaly event detection model based on the improved temporal action detection technology is established. The experiment outcomes show that the improved three-dimensional convolutional network achieves convergence after 32 iterations, with an accuracy of 99.15% and a recall rate of 98.3%. The average accuracy of the three datasets tested is better than other algorithms. The average precision of the research model for detecting throwing objects from high altitude, crossing fences, smoking, and checking electricity without gloves are 89.1%, 88.9%, 96.6%, and 96.2%, respectively. The accuracy for abnormal event detection of different time periods is superior to other models. The average recall value of the research model is 94.3%, which is higher than other models. The results indicate that the research model has the capacity to accurately recognize abnormal events in massive, diverse, and complex surveillance videos. The abnormal event detection model proposed in the study can be applied to the intelligent management platform of the power industry, thereby improving the safety management capability in power operations.
Existing zero-shot temporal action detection (ZSTAD) methods predominantly use fully supervised or unsupervised strategies to recognize unseen activities. However, these training-based methods are prone to domain shifts and require high computational costs, which hinder their practical applicability in real-world scenarios. In this paper, unlike previous works, we propose a training-Free Zero-shot temporal Action Detection (FreeZAD) method, leveraging existing vision-language (ViL) models to directly classify and localize unseen activities within untrimmed videos without any additional fine-tuning or adaptation. We mitigate the need for explicit temporal modeling and reliance on pseudo-label quality by designing the LOGarithmic decay weighted Outer-Inner-Contrastive Score (LogOIC) and frequency-based Actionness Calibration. Furthermore, we introduce a test-time adaptation (TTA) strategy using Prototype-Centric Sampling (PCS) to expand FreeZAD, enabling ViL models to adapt more effectively for ZSTAD. Extensive experiments on the THUMOS14 and ActivityNet-1.3 datasets demonstrate that our training-free method outperforms state-of-the-art unsupervised methods while requiring only 1/13 of the runtime. When equipped with TTA, the enhanced method further narrows the gap with fully supervised methods.
In this paper, we examine a key limitation in query-based detectors for temporal action detection (TAD), which arises from their direct adaptation of originally designed architectures for object detection. Despite the effectiveness of the existing models, they struggle to fully address the unique challenges of TAD, such as the redundancy in multi-scale features and the limited ability to capture sufficient temporal context. To address these issues, we propose a multi-dilated gated encoder and central-adjacent region integrated decoder for temporal action detection transformer (DiGIT). Our approach replaces the existing encoder that consists of multi-scale deformable attention and feedforward network with our multi-dilated gated encoder. Our proposed encoder reduces the redundant information caused by multi-level features while maintaining the ability to capture fine-grained and long-range temporal information. Furthermore, we introduce a central-adjacent region integrated decoder that leverages a more comprehensive sampling strategy for deformable cross-attention to capture the essential information. Extensive experiments demonstrate that DiGIT achieves state-of-the-art performance on THUMOS14, ActivityNet v1.3, and HACS-Segment. Code is available at: https://github.com/Dotori-HJ/DiGIT
Temporal action detection (TAD) aims to identify and localize action instances in untrimmed videos, which is essential for various video understanding tasks. However, recent improvements in model performance, driven by larger feature extractors and datasets, have led to increased computational demands. This presents a challenge for applications like autonomous driving and robotics, which rely on limited computational resources. While existing channel pruning methods can compress these models, reducing the number of channels often hinders the parallelization efficiency of GPU, due to the inefficient multiplication between small matrices. Instead of pruning channels, we propose a Progressive Block Drop method that reduces model depth while retaining layer width. In this way, we still use large matrices for computation but reduce the number of multiplications. Our approach iteratively removes redundant blocks in two steps: first, we drop blocks with minimal impact on model performance; and second, we employ a parameter-efficient cross-depth alignment technique, fine-tuning the pruned model to restore model accuracy. Our method achieves a 25% reduction in computational overhead on two TAD benchmarks (THUMOS14 and ActivityNet-1.3) to achieve lossless compression. More critically, we empirically show that our method is orthogonal to channel pruning methods and can be combined with it to yield further efficiency gains.
Multigranularity Feature Aggregation and Cross-level Boundary Modeling for Temporal Action Detection
This article presents a Temporal Action Detection (TAD) method with Multigranularity (MG) feature aggregation and Cross-level Boundary Modeling (CBM). Compared with other methods, our proposed approach has the following advantages. First, different from most existing works which only consider the local temporal context, a simple and computationally efficient MG module is proposed to comprehensively extract video features in instant, local, and global temporal granularities. Second, unlike the methods that only employ the information from single feature pyramid level for action boundary regression, a CBM strategy that integrates the relative information from both the same and higher level features is designed to improve the accuracy of boundary prediction. At lastfere, benefiting from the MG module and CBM strategy, our method outperforms other state-of-the-art approaches on five challenging TAD datasets: THUMOS14, MultiTHUMOS, EPIC-KITCHENS-100, ActivityNet-1.3, and HACS. We make our code and pre-trained model publicly available at: https://github.com/MGCBM/TAL-MGCBM
Open-vocabulary Temporal Action Detection (Open-vocab TAD) extends the detection scope of Closed-vocabulary Temporal Action Detection (Closed-vocab TAD) to unseen action classes specified by vocabularies not included in the training data, within untrimmed video. Typical Open-vocab TAD methods adopt a two-stage approach that first proposes candidate action intervals and then identifies those actions. However, errors in the first stage can affect the subsequent stage and the final detection results. Moreover, conventional methods for temporal context analyses tend to focus solely on either global or local context. Focusing solely on the global context can lead to lack of momentary detail, making it difficult to distinguish one action from another. Conversely, focusing only on the local context makes it challenging to determine the start and end timings of action intervals. To address these challenges, we introduce a one-stage approach named Hierarchical Open-vocab TAD (HOTAD), consisting of two branches: Temporal Context Analysis (TCA) and Video–Text Alignment (VTA). The former utilizes Hierarchical Encoder (HE) to fuse global and local temporal features, enabling a comprehensive capture of temporal actions, while the latter branch exploits the synergy between visual and textual modalities for precisely detecting unseen actions in the Open-vocab setting. Experiments and in-depth analysis using the widely recognized datasets THUMOS14 and ActivityNet-1.3 are performed to show the effectiveness of HOTAD. The results highlight remarkable accuracy in detecting a wide range of unseen actions. Furthermore, HOTAD significantly reduces wrong labels and localizes action instances with high precision, showcasing its robustness in complex and dynamic video settings.
Abstract Sports videos contain a large number of irrelevant backgrounds and static frames, which affect the efficiency and accuracy of temporal action detection. To optimize sports video data processing and temporal action detection, an improved multi-level spatiotemporal transformer network model is proposed. The model first optimizes the initial feature extraction of videos through an unsupervised video data preprocessing model based on deep residual networks. Subsequently, multi-scale features are generated through feature pyramid networks. The global spatiotemporal dependencies of actions are captured by a spatiotemporal encoder. The frame-level self-attention module further extracts keyframes and highlights temporal features, thereby improving detection accuracy. The accuracy of the proposed model was 0.6 at the beginning. After 300 iterations, the accuracy was 0.85. After 500 iterations, the highest accuracy was close to 0.9. The mAP of the improved model on the dataset reached 90.5%, which was higher than the 78.2% of the base model. The recall rate was 92.0%, the precision was 89.5%, and the calculation time was 220 ms. Meanwhile, the model shows balanced performance in detecting movements of different types of sports, especially in recognizing complex movements such as gymnastics and diving. This model effectively improves the efficiency and accuracy of time action detection through the collaborative action of multiple modules, demonstrating good applicability and robustness.
Temporal Action Detection (TAD) aims to identify and localize actions by determining their starting and ending frames within untrimmed videos. Recent Structured State-Space Models such as Mamba have demonstrated potential in TAD due to their long-range modeling capability and linear computational complexity. On the other hand, structured state-space models often face two key challenges in TAD, namely, decay of temporal context due to recursive processing and self-element conflict during global visual context modeling, which become more severe while handling long-span action instances. Additionally, traditional methods for TAD struggle with detecting long-span action instances due to a lack of global awareness and inefficient detection heads. This paper presents MambaTAD, a new state-space TAD model that introduces long-range modeling and global feature detection capabilities for accurate temporal action detection. MambaTAD comprises two novel designs that complement each other with superior TAD performance. First, it introduces a Diagonal-Masked Bidirectional State-Space (DMBSS) module which effectively facilitates global feature fusion and temporal action detection. Second, it introduces a global feature fusion head that refines the detection progressively with multi-granularity features and global awareness. In addition, MambaTAD tackles TAD in an end-to-end one-stage manner using a new state-space temporal adapter(SSTA) which reduces network parameters and computation cost with linear complexity. Extensive experiments show that MambaTAD achieves superior TAD performance consistently across multiple public benchmarks.
Temporal action detection aims to locate and classify actions in untrimmed videos. While recent works focus on designing powerful feature processors for pre-trained representations, they often overlook the inherent noise and redundancy within these features. Large-scale pre-trained video encoders tend to introduce background clutter and irrelevant semantics, leading to context confusion and imprecise boundaries. To address this, we propose a frequency-aware decoupling network that improves action discriminability by filtering out noisy semantics captured by pre-trained models. Specifically, we introduce an adaptive temporal decoupling scheme that suppresses irrelevant information while preserving fine-grained atomic action details, yielding more task-specific representations. In addition, we enhance inter-frame modeling by capturing temporal variations to better distinguish actions from background redundancy. Furthermore, we present a long-short-term category-aware relation network that jointly models local transitions and long-range dependencies, improving localization precision. The refined atomic features and frequency-guided dynamics are fed into a standard detection head to produce accurate action predictions. Extensive experiments on THUMOS14, HACS, and ActivityNet-1.3 show that our method, powered by InternVideo2-6B features, achieves state-of-the-art performance on temporal action detection benchmarks.
No abstract available
Recent proposed neural network-based Temporal Action Detection (TAD) models are inherently limited to extracting the discriminative representations and modeling action instances with various lengths from complex scenes by shared-weights detection heads. Inspired by the successes in dynamic neural networks, in this paper, we build a novel dynamic feature aggregation (DFA) module that can simultaneously adapt kernel weights and receptive fields at different timestamps. Based on DFA, the proposed dynamic encoder layer aggregates the temporal features within the action time ranges and guarantees the discriminability of the extracted representations. Moreover, using DFA helps to develop a Dynamic TAD head (DyHead), which adaptively aggregates the multi-scale features with adjusted parameters and learned receptive fields better to detect the action instances with diverse ranges from videos. With the proposed encoder layer and DyHead, a new dynamic TAD model, DyFADet, achieves promising performance on a series of challenging TAD benchmarks, including HACS-Segment, THUMOS14, ActivityNet-1.3, Epic-Kitchen 100, Ego4D-Moment QueriesV1.0, and FineAction. Code is released to https://github.com/yangle15/DyFADet-pytorch.
Temporal Action Detection (TAD) aims to identify the action boundaries and the corresponding category within untrimmed videos. Inspired by the success of DETR in object detection, several methods have adapted the query-based framework to the TAD task. However, these approaches primarily followed DETR to predict actions at the instance level (i.e., identify each action by its center point), leading to sub-optimal boundary localization. To address this issue, we propose a new Dual-level query-based TAD framework, namely DualDETR, to detect actions from both instance-level and boundary-level. Decoding at different levels requires semantics of different granularity, therefore we introduce a two-branch decoding structure. This structure builds distinctive decoding processes for different lev-els, facilitating explicit capture of temporal cues and se-mantics at each level. On top of the two-branch design, we present a joint query initialization strategy to align queries from both levels. Specifically, we leverage encoder propos-als to match queries from each level in a one-to-one man-ner. Then, the matched queries are initialized using position and content prior from the matched action proposal. The aligned dual-level queries can refine the matched proposal with complementary cues during subsequent decoding. We evaluate DualDETR on three challenging multi-label TAD benchmarks. The experimental results demonstrate the su-perior performance of DualDETR to the existing state-of-the-art methods, achieving a substantial improvement under det-mAP and delivering impressive results under seg-mAP.
Accurate proposal generation is crucial for subsequent classification networks; thus, the temporal action proposal generation (TAPG) methods have a significant influence in the field of Temporal Action Detection. The preparation process of supervised TAPG methods is time‐consuming and resource‐intensive, relying on a large amount of labeled data. Furthermore, due to the relatively small variations in feature sequences at the temporal level in videos, localizing the boundaries of actions is particularly challenging. To address these issues, we first propose a self‐supervised pre‐training method that designs a Random Query Segment Detection pretext task as the learning objective for pre‐training. This enables the training of an action localizer without any annotations. Additionally, when localizing video action segments, the temporal boundaries can be blurred, and the simple feature contrast operation designed during the pre‐training process may not effectively distinguish action boundaries. Therefore, this work introduces an improved method, self‐supervised pre‐training transformer based on triplet (SSPT‐Tr) for feature reconstruction based on triplet to address the aforementioned issue. A negative video segment is added to reconstruct features, and triplet loss is used to further constrain the boundary feature expression capabilities between action and background. This can effectively enhance the feature discrimination between actions and non‐actions. Extensive experiments on the THUMOS14 and ActivityNet‐1.3 datasets demonstrate that the SSPT‐Tr method improves the performance obviously, which not only improves the AR but also shortens the training time of the downstream task. The SSPT‐Tr combined with UNet also outperforms other methods in the field of Temporal Action Detection in terms of mAP at various tIoU thresholds. © 2025 Institute of Electrical Engineers of Japan and Wiley Periodicals LLC.
Temporal action detection (TAD) is a vital challenge in computer vision and the Internet of Things, aiming to detect and identify actions within temporal sequences. While TAD has primarily been associated with video data, its applications can also be extended to sensor data, opening up opportunities for various real-world applications. However, applying existing TAD models to sensory signals presents distinct challenges such as varying sampling rates, intricate pattern structures, and subtle, noise-prone patterns. In response to these challenges, we propose a Sensory Temporal Action Detection (STADe) model. STADe leverages Fourier kernels and adaptive frequency filtering to adaptively capture the nuanced interplay of temporal and frequency features underlying complex patterns. Moreover, STADe embraces adaptability by employing deep fusion at varying resolutions and scales, making it versatile enough to accommodate diverse data characteristics, such as the wide spectrum of sampling rates and action durations encountered in sensory signals. Unlike conventional models with unidirectional category-to-proposal dependencies, STADe adopts a cross-cascade predictor to introduce bidirectional and temporal dependencies within categories. To extensively evaluate STADe and promote future research in sensory TAD, we establish three diverse datasets using various sensors, featuring diverse sensor types, action categories, and sampling rates. Experiments across one public and our three new datasets demonstrate STADe’s superior performance over state-of-the-art TAD models in sensory TAD tasks.
Temporal action detection aims to predict temporal boundaries and category labels of actions in untrimmed videos. In the past years, many weakly supervised temporal action detection methods have been proposed to relieve the annotation cost of fully supervised methods. Due to the discrepancy between action localization and action classification, the two-branch structure is widely adopted by existing weakly supervised methods, where the classification branch is used to predict category-wise score and the localization branch is used to predict foreground score for each segment. Under the weakly supervised setting, the model training is mainly guided by the video-level or sparse segment-level annotations. As a result, the classification branch tends to focus on the most discriminative segments while ignore less discriminative ones so as to minimize the classification cost, and the localization branch may assign high foreground scores for some negative segments. This phenomenon can severely damage the action detection performance, because the foreground scores and classification scores are combined together in the testing stage for action detection. To deal with this problem, several methods have been proposed to encourage the consistency between the classification branch and localization branch. However, these methods only consider the video-level or segment-level consistency, without considering the relation among different segments to be consistent. In this paper, we propose a Cross-Task Relation-Aware Consistency (CRC) strategy for weakly supervised temporal action detection, including an intra-video consistency module and an inter-video consistency module. The intra-video consistency module can well guarantee the relationship among segments from the same video to be consistent, and the inter-video consistency module guarantees the relationship among segments from different videos to be consistent. These two modules are complementary to each other by combining both intra-video and inter-video consistency. Experimental results show that the proposed CRC strategy can consistently improve the performance of existing weakly supervised methods, including click-level supervised methods (e.g., LACP Lee et al., 2021), video-level supervised methods (e.g., DELU Chen et al., 2022) and unsupervised methods (e.g., BaS-Net Lee et al., 2020), verifying the generality and effectiveness of the proposed method.
Temporal action detection (TAD) is a critical task in video understanding. Nevertheless, most existing closed-set TAD methods often struggle to replicate their high performance when completely unseen or unknown actions emerge in an open-world test environment. To this end, the open-set temporal action detection (OSTAD) task has been recently proposed to relax the closed-set TAD condition to the unknown-aware open-set detection. Given only a limited number of known action classes available in model training, precisely localizing and rejecting the unknown action instances is extremely difficult and requires strong model generalization abilities. However, existing approaches are yet far from optimal in discriminative action feature learning and prediction uncertainty estimation, which may hamper model generalization to unknown action detection. To address these issues, this paper proposes a novel Causal and Evidential Open-set Temporal Action Detection model named CEO-TAD for improved OSTAD performance. It accomplishes expressive video feature pyramid extraction, discriminative causal action feature representation learning, and reliable EDL-based prediction uncertainty estimation with our tailored network architectures and modified loss functions. Experimental results show that our proposed method achieves state-of-the-art open-set temporal action detection performance on the THUMOS14 and ActivityNet1.3 benchmarks. Ablation studies verify the effectiveness of the proposed model components.
No abstract available
Temporal action detection is a key task in video understanding, with one major challenge being the handling of confounders. Confounders include both observed factors (e.g., temporal order, co-occurrence patterns of actions) and unobserved factors (e.g., lighting, individual states), which can introduce bias and affect predictions. While causal inference methods have been introduced, they often rely on fixed representations of confounders, limiting their adaptability to dynamic contexts, particularly with unobserved confounders. To address this, we propose TAD-IVR, which combines Transformer with instrumental variable (IV) regression. Transformer flexibly captures the temporal dependencies of actions, improving the representation of confounders, while IV regression uses exogenous variables to eliminate the influence of unobserved confounders, thus reducing prediction bias. Additionally, we introduce mutual information constraints and zero-sum optimization strategies to enforce more informative and accurate feature representations. Experimental results show that TAD-IVR effectively mitigates confounding effects and improves detection accuracy.
Temporal Action Detection (TAD) is a fundamental task in video understanding that aims to identify and localize action instances within videos. Although recent methods have achieved remarkable progress, they are built upon various combinations of temporal backbones and pre-trained features, making it difficult to assess the true effectiveness of each component. To address this, we conduct a systematic study of these combinations. Our analysis reveals that Transformer-based pre-trained features already provide sufficient global context, rendering additional global modeling in the backbone redundant. Instead, performance significantly improves when these global features are complemented by dedicated local temporal modeling. Motivated by this insight, we propose Local Temporal Mamba (LTMamba), which preserves the rich global context from pre-trained features while integrating Local Mamba blocks into the temporal backbone. These blocks excel at efficiently modeling complex local dependencies within variable temporal windows, enabling the model to effectively exploit both global and local information. To validate the effectiveness of this design, we demonstrate that LTMamba outperforms state-of-the-art methods that rely on global modeling in both the pre-trained features and the temporal backbone, achieving $73.7 \% \mathrm{mAP}$ on THUMOS14 (+1.0) and $\mathbf{4 2. 4 \% ~ m A P}$ on ActivityNet (+0.4).
In this paper, we investigate that the normalized co-ordinate expression is a key factor as reliance on hand-crafted components in query-based detectors for tempo-ral action detection (TAD). Despite significant advancements towards an end-to-end framework in object detection, query-based detectors have been limited in achieving full end-to-end modeling in TAD. To address this is-sue, we propose TE-TAD, a full end-to-end temporal action detection transformer that integrates time-aligned co-ordinate expression. We reformulate coordinate expression utilizing actual time line values, ensuring length-invariant representations from the extremely diverse video duration environment. Furthermore, our proposed adaptive query selection dynamically adjusts the number of queries based on video length, providing a suitable solution for varying video durations compared to a fixed query set. Our approach not only simplifies the TAD process by eliminating the needfor hand-crafted components but also significantly improves the performance of query-based detectors. Our TE-TAD outperforms the previous query-based detectors and achieves competitive performance compared to state-of-the-art methods on popular benchmark datasets. Code is available at: https://github.com/Dotori-HJ/TE-TAD
No abstract available
Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos. Despite that they focus on different events, we observe they have a significant connection. For instance, most descriptions in MR involve multiple actions from TAD. In this paper, we aim to investigate the potential synergy between TAD and MR. Firstly, we propose a unified architecture, termed Unified Moment Detection (UniMD), for both TAD and MR. It transforms the inputs of the two tasks, namely actions for TAD or events for MR, into a common embedding space, and utilizes two novel query-dependent decoders to generate a uniform output of classification score and temporal segments. Secondly, we explore the efficacy of two task fusion learning approaches, pre-training and co-training, in order to enhance the mutual benefits between TAD and MR. Extensive experiments demonstrate that the proposed task fusion learning scheme enables the two tasks to help each other and outperform the separately trained counterparts. Impressively, UniMD achieves state-of-the-art results on three paired datasets Ego4D, Charades-STA, and ActivityNet. Our code is available at https://github.com/yingsen1/UniMD.
Recently, temporal action detection (TAD) has seen significant performance improvement with end-to-end training. However, due to the memory bottleneck, only models with limited scales and limited data volumes can afford end-to-end training, which inevitably restricts TAD performance. In this paper, we reduce the memory consumption for end-to-end training, and manage to scale up the TAD backbone to 1 billion parameters and the input video to 1,536 frames, leading to significant detection performance. The key to our approach lies in our proposed temporal-informative adapter (TIA), which is a novel lightweight module that reduces training memory. Using TIA, we free the humongous backbone from learning to adapt to the TAD task by only updating the parameters in TIA. TIA also leads to better TAD representation by temporally aggregating context from adjacent frames throughout the backbone. We evaluate our model across four representative datasets. Owing to our efficient design, we are able to train end-to-end on VideoMAEv2-giant and achieve 75.4% mAP on THUMOS14, being the first end-to-end model to outper-form the best feature-based methods. Code is available at https://github.com/sming256/AdaTAD.
This work aims at advancing temporal action detection (TAD) using an encoder-decoder framework with action queries, similar to DETR, which has shown great success in object detection. However, the framework suffers from several problems if directly applied to TAD: the insufficient exploration of inter-query relation in the decoder, the inadequate classification training due to a limited number of training samples, and the unreliable classification scores at inference. To this end, we first propose a relational attention mechanism in the decoder, which guides the attention among queries based on their relations. Moreover, we propose two losses to facilitate and stabilize the training of action classification. Lastly, we propose to predict the localization quality of each action query at inference in order to distinguish high-quality queries. The proposed method, named ReAct, achieves the state-of-the-art performance on THUMOS14, with much lower computational costs than previous methods. Besides, extensive ablation studies are conducted to verify the effectiveness of each proposed component. The code is available at https://github.com/sssste/React.
Temporal action detection (TAD) aims to locate action positions and recognize action categories in long-term untrimmed videos. Although many methods have achieved promising results, their robustness has not been thoroughly studied. In practice, we observe that temporal information in videos can be occasionally corrupted, such as missing or blurred frames. Interestingly, existing methods often incur a significant performance drop even if only one frame is affected. To formally evaluate the robustness, we establish two temporal corruption robustness benchmarks, namely THUMOS14-C and ActivityNet-v1.3-C. In this paper, we extensively analyze the robustness of seven leading TAD methods and obtain some interesting findings: 1) Existing methods are particularly vulnerable to temporal corruptions, and end-to-end methods are often more susceptible than those with a pretrained feature extractor; 2) Vulnera-bility mainly comes from localization error rather than classification error; 3) When corruptions occur in the middle of an action instance, TAD models tend to yield the largest performance drop. Besides building a benchmark, we further develop a simple but effective robust training method to defend against temporal corruptions, through the Frame-Drop augmentation and Temporal-Robust Consistency loss. Remarkably, our approach not only improves robustness but also yields promising improvements on clean data. We believe that this study will serve as a benchmark for future research in robust video analysis. Source code and models are available at https://github.com/Alvin-Zeng/temporal-robustness-benchmark.
Existing temporal action detection (TAD) methods rely on large training data including segment-level annotations, limited to recognizing previously seen classes alone during inference. Collecting and annotating a large training set for each class of interest is costly and hence unscalable. Zero-shot TAD (ZS-TAD) resolves this obstacle by enabling a pre-trained model to recognize any unseen action classes. Meanwhile, ZS-TAD is also much more challenging with significantly less investigation. Inspired by the success of zero-shot image classification aided by vision-language (ViL) models such as CLIP, we aim to tackle the more complex TAD task. An intuitive method is to integrate an off-the-shelf proposal detector with CLIP style classification. However, due to the sequential localization (e.g, proposal generation) and classification design, it is prone to localization error propagation. To overcome this problem, in this paper we propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE). Such a novel design effectively eliminates the dependence between localization and classification by breaking the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for improved optimization. Extensive experiments on standard ZS-TAD video benchmarks show that our STALE significantly outperforms state-of-the-art alternatives. Besides, our model also yields superior results on supervised TAD over recent strong competitors. The PyTorch implementation of STALE is available at https://github.com/sauradip/STALE.
Zero-shot temporal action detection (ZS-TAD), aiming to recognize and detect new and unseen video actions, is an emerging and challenging task with limited solutions. Recent studies have adapted the vision-language pre-trained model CLIP for this task in a parameter-efficient fine-tuning fashion to achieve open-vocabulary detection. However, they suffer from insufficient vision-text alignment because of the dual-stream structure of CLIP and yield inferior TAD results due to the lack of accurate action prior. In this paper, we target the above limitations and propose to learn multimodal Prompts and Text-Enhanced Actionness (mProTEA) for ZS-TAD. Specifically, we insert learnable layer-wise prompts into the vision and text branches of the frozen CLIP and establish a strong coupling between them, resulting in multimodal prompts that can boost cross-modal alignment. To ease computation costs, we propose to conduct multimodal prompt learning on an image recognition dataset with rich concepts (e.g., ImageNet) first and then keep them frozen during TAD fine-tuning. For improving TAD, we introduce text-enhanced actionness modeling, where we leverage the concise semantics of text to assist the calculation of class-agnostic actionness scores, to offer accurate prior information for both action classification and localization. With the above designs, our mProTEA excels in extensive TAD experiments, surpassing the strong competitor STALE by 5.1% on ActivityNet under the zero-shot setting and achieving state-of-the-art performance in conventional supervised scenarios. Ablation studies confirm the effectiveness of our proposals and show superior domain generalization of multimodal prompts learned on ImageNet against the other 10 image recognition datasets.
Action recognition and localization in complex, untrimmed videos remain a formidable challenge in computer vision, largely due to the limitations of existing methods in capturing fine-grained actions, long-term temporal dependencies, and high-level semantic information from low-level visual features. This paper introduces the Event-Contextualized Video Transformer (ECVT), a novel architecture that leverages the advanced semantic understanding capabilities of Large Vision-Language Models (LVLMs) to bridge this gap. ECVT employs a dual-branch design, comprising a Video Encoding Branch for spatio-temporal feature extraction and a Cross-Modal Guidance Branch. The latter utilizes an LVLM to generate multi-granularity semantic descriptions, including Global Event Prompting for macro-level narrative and Temporal Sub-event Prompting for fine-grained action details. These multi-level textual cues are integrated into the video encoder's learning process through sophisticated mechanisms such as adaptive gating for high-level semantic fusion, cross-modal attention for fine-grained feature refinement, and an event graph module for temporal context calibration. Trained end-to-end with a comprehensive loss function incorporating semantic consistency and temporal calibration terms, ECVT significantly enhances the model's ability to understand video temporal structures and event logic. Extensive experiments on ActivityNet v1.3 and THUMOS14 datasets demonstrate that ECVT achieves state-of-the-art performance, with an average mAP of 40.5% on ActivityNet v1.3 and mAP@0.5 of 67.1% on THUMOS14, outperforming leading baselines.
Weakly-Supervised Temporal Action Localization is a very challenging task of classifying and locating all actions in an untrimmed video because the frame-wise label is not given during the training stage while the only label is action class. Due to the complexity of video structure, previous methods do not take the advantage of the context information between long-term action-related frames. In this paper, we propose a Global Context Relation Network which introduces the self-attention mechanism. The first part uses the context relation module to encode features according to the relationship of the global context and merge them into the original features, which allows the network to better capture the long-term dependence on the video. Then the local feature encoding is performed by convolution to obtain a more accurate class activation sequence. Extensive experiments on two benchmark datasets THUMOS14 and ActivityNet1.3 and demonstrate our method outperforms existing state-of-the-art results on THUMOS14 and achieves very comparable performance on ActivityNet1.3.
No abstract available
We present an efficient approach for temporal action co-localization (TACL), which means to simultaneously localize all action instances in an untrimmed video. Compared with the conventional instance-by-instance action localization, TACL can exploit the contextual and temporal relationships among action instances to reduce the localization ambiguities. Motivated by the strong relational modeling capability of graph neural networks, we propose a Graph-based Temporal Action Co-Localization (G-TACL) method. By considering each action proposal as a node, G-TACL effectively aggregates contextual and temporal features from related action proposals to jointly recognize and localize all action instances in a single shot. Moreover, we introduce a novel multi-level consistency evaluator to measure the relatedness between any two action proposals. This is achieved by considering their high-level contextual similarities, low-level temporal coincidences and feature correlations. We exploit the Gated Recurrent Units (GRUs) to iteratively update the features of each node, which are then used to regress the temporal boundaries of action proposals and finally achieve action co-localization. Experimental results on three datasets, i.e. , THUMOS14, MEXaction2 and ActivityNet v1.3 datasets demonstrate that our G-TACL is superior or comparable to the state-of-the-arts.
Weakly-supervised temporal action localiza- tion aims to identify action instances using only video-level labels, and localize the action position in untrimmed videos. Due to the temporal continuity of video data, most methods that use single scale convolution kernel cannot model against the characterization of video data effectively, and lead to a decrease in accuracy. However, simply using multi-scale features can introduce redundant information and noise, reducing model efficiency while also affecting the accurate judgement of the model during training process. To alleviate this problem, a video complicated-information extraction and filtering network (VCEF-Net) is proposed. It contains two main modules. The first multi-scale feature extraction module is developed to enrich the information that model received. The second pseudo-label filtering module inhibits redundant information interference. VCEF-Net introduces these two modules for achieving a better utilization of video information. Experiments tested on THUMOS14 and ActivityNet1.2 demonstrate better performances of the proposed VCEF-Net and validate its effectiveness.
Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels. Despite recent advances, existing approaches mainly follow a localization-by-classification pipeline, generally processing each segment individually, thereby exploiting only limited contextual information. As a result, the model will lack a comprehensive understanding (e.g. appearance and temporal structure) of various action patterns, leading to ambiguity in classification learning and temporal localization. Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset to recover the dataset-level semantic structure of action instances via weak labels only, thereby indirectly improving the holistic understanding of fine-grained action patterns and alleviating the aforementioned ambiguities. Specifically, an end-to-end framework is proposed, including a Robust Memory-Guided Contrastive Learning (RMGCL) module and a Global Knowledge Summarization and Aggregation (GKSA) module. First, the RMGCL module explores the contrast and consistency of cross-video action features, assisting in learning more structured and compact embedding space, thus reducing ambiguity in classification learning. Further, the GKSA module is used to efficiently summarize and propagate the cross-video representative action knowledge in a learnable manner to promote holistic action patterns understanding, which in turn allows the generation of high-confidence pseudo-labels for self-learning, thus alleviating ambiguity in temporal localization. Extensive experiments on THUMOS14, ActivityNet1.3, and FineAction demonstrate that our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.
Temporal action localization (TAL) in untrimmed videos recently emerged as a crucial research topic, which has been applied in various applications such as surveillance, crowd monitoring, and driver distraction recognition. Most modern approaches in TAL divide this problem into two parts: i) feature extraction for action recognition; and ii) temporal boundary for action localization. In this study, we focus on improving the performance of the TAL task by exploiting the feature extraction effectively. Specifically, we present a temporal triplet algorithm in order to enhance temporal density-dependence information for the input video clips. Moreover, the multiview fusion framework is taken into account for enriching action representation. For the evaluation, we conduct the proposed method on the 2023 AI City Challenge Dataset. Accordingly, our method achieves competitive results and belongs to the top public leaderboard in Track 3 of the Challenge.
Weakly-supervised temporal action localization aims to localize actions in untrimmed videos with only video-level labels. Most existing methods address this problem with a “localization-by-classification” pipeline that localizes action regions based on snippet-wise classification sequences. Snippet-wise classifications are unfortunately error prone due to the sparsity of video-level labels. Inspired by recent success in unsupervised contrastive representation learning, we propose a novel denoised cross-video contrastive algorithm, aiming to enhance the feature discrimination ability of video snippets for accurate temporal action localization in the weakly-supervised setting. This is enabled by three key designs: 1) an effective pseudo-label denoising module to alleviate the side effects caused by noisy contrastive features, 2) an efficient region-level feature contrast strategy with a region-level memory bank to capture “global” contrast across the entire dataset, and 3) a diverse contrastive learning strategy to enable action-background separation as well as intra-class compactness & inter-class separability. Extensive experiments on THUMOS14 and ActivityNet v1.3 demonstrate the superior performance of our approach.
Weakly-supervised temporal action localization aims to identify and localize the action instances in the untrimmed videos with only video-level action labels. When humans watch videos, we can adapt our abstract-level knowledge about actions in different video scenarios and detect whether some actions are occurring. In this paper, we mimic how humans do and bring a new perspective for locating and identifying multiple actions in a video. We propose a network named VQK-Net with a video-specific query-key attention modeling that learns a unique query for each action category of each input video. The learned queries not only contain the actions' knowledge features at the abstract level but also have the ability to fit this knowledge into the target video scenario, and they will be used to detect the presence of the corresponding action along the temporal dimension. To better learn these action category queries, we exploit not only the features of the current input video but also the correlation between different videos through a novel video-specific action category query learner worked with a query similarity loss. Finally, we conduct extensive experiments on three commonly used datasets (THUMOS14, ActivityNet1.2, and ActivityNet1.3) and achieve state-of-the-art performance.
Temporal action localization (TAL) is an important and challenging problem in video understanding. However, most existing TAL benchmarks are built upon the coarse granularity of action classes, which exhibits two major limitations in this task. First, coarse-level actions can make the localization models overfit in high-level context information, and ignore the atomic action details in the video. Second, the coarse action classes often lead to the ambiguous annotations of temporal boundaries, which are inappropriate for temporal action localization. To tackle these problems, we develop a novel large-scale and fine-grained video dataset, coined as FineAction, for temporal action localization. In total, FineAction contains 103K temporal instances of 106 action categories, annotated in 17K untrimmed videos. Compared to the existing TAL datasets, our FineAction takes distinct characteristics of fine action classes with rich diversity, dense annotations of multiple instances, and co-occurring actions of different classes, which introduces new opportunities and challenges for temporal action localization. To benchmark FineAction, we systematically investigate the performance of several popular temporal localization methods on it, and deeply analyze the influence of fine-grained instances in temporal action localization. As a minor contribution, we present a simple baseline approach for handling the fine-grained action detection, which achieves an mAP of 13.17% on our FineAction. We believe that FineAction can advance research of temporal action localization and beyond. The dataset is available at https://deeperaction.github.io/datasets/fineaction.
With video-level labels, weakly supervised temporal action localization (WTAL) applies a localization-by-classification paradigm to detect and classify the action in untrimmed videos. Due to the characteristic of classification, class-specific background snippets are inevitably mis-activated to improve the discriminability of the classifier in WTAL. To alleviate the disturbance of background, existing methods try to enlarge the discrepancy between action and background through modeling background snippets with pseudo-snippet-level annotations, which largely rely on artificial hypotheticals. Distinct from the previous works, we present an adversarial learning strategy to break the limitation of mining pseudo background snippets. Concretely, the background classification loss forces the whole video to be regarded as the background by a background gradient reinforcement strategy, confusing the recognition model. Reversely, the foreground(action) loss guides the model to focus on action snippets under such conditions. As a result, competition between the two classification losses drives the model to boost its ability for action modeling. Simultaneously, a novel temporal enhancement network is designed to facilitate the model to construct temporal relation of affinity snippets based on the proposed strategy, for further improving the performance of action localization. Finally, extensive experiments conducted on THUMOS14 and ActivityNet1.2 demonstrate the effectiveness of the proposed method.
We propose a novel method of exploiting informative video segments by learning segment weights for temporal action localization in untrimmed videos. Informative video segments represent the intrinsic motion and appearance of an action, and thus contribute crucially to action localization. The learned segment weights represent the informativeness of video segments to recognize actions and help infer the boundaries required to temporally localize actions. We build a supervised temporal attention network (STAN) that includes a supervised segment-level attention module to dynamically learn the weights of video segments, and a feature-level attention module to effectively fuse multiple features of segments. Through the cascade of the attention modules, STAN exploits informative video segments and generates descriptive and discriminative video representations. We use a proposal generator and a classifier to estimate the boundaries of actions and classify the classes of actions. Extensive experiments are conducted on two public benchmarks, i.e., THUMOS2014 and ActivityNet1.3. The results demonstrate that our proposed method achieves competitive performance compared with existing state-of-the-art methods. Moreover, compared with the baseline method that treats video segments equally, STAN achieves significant improvements with an increase of the mean average precision from 30.4% to 39.8% on the THUMOS2014 dataset, and from 31.4% to 35.9% on the ActivityNet1.3 dataset, demonstrating the effectiveness of learning informative video segments for temporal action localization.
Temporal Action Localization (TAL) remains a fundamental challenge in video understanding, aiming to identify the start time, end time, and category of all action instances within untrimmed videos. While recent single-stage, anchor-free models like ActionFormer have set a high standard by leveraging Transformers for temporal reasoning, they often struggle with two persistent issues: the precise localization of actions with ambiguous or"fuzzy"temporal boundaries and the effective fusion of multi-scale contextual information. In this paper, we introduce the Temporal Boundary Transformer (TBT-Former), a new architecture that directly addresses these limitations. TBT-Former enhances the strong ActionFormer baseline with three core contributions: (1) a higher-capacity scaled Transformer backbone with an increased number of attention heads and an expanded Multi-Layer Perceptron (MLP) dimension for more powerful temporal feature extraction; (2) a cross-scale feature pyramid network (FPN) that integrates a top-down pathway with lateral connections, enabling richer fusion of high-level semantics and low-level temporal details; and (3) a novel boundary distribution regression head. Inspired by the principles of Generalized Focal Loss (GFL), this new head recasts the challenging task of boundary regression as a more flexible probability distribution learning problem, allowing the model to explicitly represent and reason about boundary uncertainty. Within the paradigm of Transformer-based architectures, TBT-Former advances the formidable benchmark set by its predecessors, establishing a new level of performance on the highly competitive THUMOS14 and EPIC-Kitchens 100 datasets, while remaining competitive on the large-scale ActivityNet-1.3. Our code is available at https://github.com/aaivu/In21-S7-CS4681-AML-Research-Projects/tree/main/projects/210536K-Multi-Modal-Learning_Video-Understanding
Temporal action localization (TAL) has drawn much attention in recent years, however, the performance of previous methods is still far from satisfactory due to the lack of annotated untrimmed video data. To deal with this issue, we propose to improve the utilization of current data through feature augmentation. Given an input video, we first extract video features with pre-trained video encoders, and then randomly mask various semantic contents of video features to consider different views of video features. To avoid damaging important action-related semantic information, we further develop a learnable feature augmentation framework to generate better views of videos. In particular, a Mask-based Feature Augmentation Module (MFAM) is proposed. The MFAM has three advantages: 1) it captures the temporal and semantic relationships of original video features, 2) it generates masked features with indispensable action-related information, and 3) it randomly recycles some masked information to ensure diversity. Finally, we input the masked features and the original features into shared action detectors respectively, and perform action classification and localization jointly for model learning. The proposed framework can improve the robustness and generalization of action detectors by learning more and better views of videos. In the testing stage, the MFAM can be removed, which does not bring extra computational costs. Extensive experiments are conducted on four TAL benchmark datasets. Our proposed framework significantly improves different TAL models and achieves the state-of-the-art performances.
Naturalistic driving studies with computer vision techniques have become an emergent research issue. The objective is to classify the distracted behavior actions by drivers. Specifically, this issue is regarded as temporal action localization (TAL) of untrimmed videos, which is a challenging task in the research field of video analysis. Particularly, TAL remains as one of the most challenging unsolved problems in computer vision that requires not only the recognition of action but the localization of the start and end times of each action. Most state-of-the-art approaches adopt complex architectures, which are expensive training and inefficient inference time. In this study, we propose a new framework for untrimmed naturalistic driving videos by utilizing the results from 3D action recognition with video clip classification for short temporal and spatial correlation. Then, simple post-processing based on data-driven is presented for long temporal correlation in untrimmed videos. The proposed method is evaluated on the AI City Challenge 2022 dataset for Naturalistic Driving Action Recognition. Accordingly, our method achieves the top 1 on the public leaderboard of the challenge.
No abstract available
Detecting actions in videos have been widely applied in on-device applications, such as cars, robots, etc. Practical on-device videos are always untrimmed with both action and background. It is desirable for a model to both recognize the class of action and localize the temporal position where the action happens. Such a task is called temporal action location (TAL), which is always trained on the cloud where multiple untrimmed videos are collected and labeled. It is desirable for a TAL model to continuously and locally learn from new data, which can directly improve the action detection precision while protecting customers’ privacy. However, directly training a TAL model on the device is nontrivial. To train a TAL model which can precisely recognize and localize each action, tremendous video samples with temporal annotations are required. However, annotating videos frame by frame is exorbitantly time consuming and expensive. Although weakly supervised temporal action localization (W-TAL) has been proposed to learn from untrimmed videos with only video-level labels, such an approach is also not suitable for on-device learning scenarios. In practical on-device learning applications, data are collected in streaming. For example, the camera on the device keeps collecting video frames for hours or days, and the actions of nearly all classes are included in a single long video stream. Dividing such a long video stream into multiple video segments requires lots of human effort, which hinders the exploration of applying the TAL tasks to realistic on-device learning applications. To enable W-TAL models to learn from a long, untrimmed streaming video, we propose an efficient video learning approach that can directly adapt to new environments. We first propose a self-adaptive video dividing approach with a contrast score-based segment merging approach to convert the video stream into multiple segments. Then, we explore different sampling strategies on the TAL tasks to request as few labels as possible. To the best of our knowledge, we are the first attempt to directly learn from the on-device, long video stream. Experimental results on the THUMOS’14 dataset show that the performance of our approach is comparable to the current W-TAL state-of-the-art (SOTA) work without any laborious manual video splitting.
Temporal action localization (TAL) is a fundamental yet challenging task in video understanding. Existing TAL methods rely on pre-training a video encoder through action classification supervision. This results in a task discrepancy problem for the video encoder -- trained for action classification, but used for TAL. Intuitively, end-to-end model optimization is a good solution. However, this is not operable for TAL subject to the GPU memory constraints, due to the prohibitive computational cost in processing long untrimmed videos. In this paper, we resolve this challenge by introducing a novel low-fidelity end-to-end (LoFi) video encoder pre-training method. Instead of always using the full training configurations for TAL learning, we propose to reduce the mini-batch composition in terms of temporal, spatial or spatio-temporal resolution so that end-to-end optimization for the video encoder becomes operable under the memory conditions of a mid-range hardware budget. Crucially, this enables the gradient to flow backward through the video encoder from a TAL loss supervision, favourably solving the task discrepancy problem and providing more effective feature representations. Extensive experiments show that the proposed LoFi pre-training approach can significantly enhance the performance of existing TAL methods. Encouragingly, even with a lightweight ResNet18 based video encoder in a single RGB stream, our method surpasses two-stream ResNet50 based alternatives with expensive optical flow, often by a good margin.
Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective, training-based ZS-TAL approaches assume the availability of labeled data for supervised learning, which can be impractical in some applications. Furthermore, the training process naturally induces a domain bias into the learned model, which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective, relaxing the requirement for training data. To this aim, we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell, T3AL adapts a pre-trained Vision and Language Model (VLM). T3AL operates in three steps. First, a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then, action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally, frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of T3AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs, confirming the benefit of a test-time adaptation approach.
Temporal action localization (TAL), which aims to identify and localize actions in long untrimmed videos, is a challenging task in video understanding. Recent studies have shown that the Transformer and its variants are effective at improving the performance of TAL. The success of the Transformer can be attributed to the use of multi-head self-attention (MHSA) as a token mixer to capture long-term temporal dependencies within the video sequence. However, in the existing Transformer architecture, the features obtained by multiple token mixing (i.e., self-attention) heads are treated equally, which neglects the distinct characteristics of different heads and hampers the exploitation of discriminative information. To this end, we present a new method called the adaptive dual selective Transformer (ADSFormer) for TAL in this paper. The key component in ADSFormer is the dual selective multi-head token mixer (DSMHTM), which integrates multiple feature representations from different token mixing heads by adaptively selecting important features across both the head and channel dimensions. Moreover, we also incorporate our ADSFormer into a pyramid structure so that the multi-scale features obtained can be effectively combined to improve TAL performance. Benefiting from the dual selective multi-head token mixer (DSMHTM) and pyramid feature combination, ADSFormer outperforms several state-of-the-art methods on four challenging benchmark datasets: THUMOS14, MultiTHUMOS, EPIC-KITCHENS-100 and ActivityNet-1.3.
Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. Since many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge through vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in a joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motions. To address these problems, we propose a novel framework that aligns human action knowledge and VLP knowledge in a probabilistic embedding space. Moreover, we propose intra- and inter-distribution contrastive learning to enhance the probabilistic embedding space based on statistical similarities. Extensive experiments and ablation studies reveal that our method significantly outperforms all previous state-of-the-art methods. Code is available at https://github.com/sejong-rcv/PVLR.
Weakly-Supervised Temporal Action Localization (WS-TAL) aims to jointly localize and classify action segments in untrimmed videos with only video-level annotations. To leverage video-level annotations, most existing methods adopt the multiple instance learning paradigm where frame-/snippet-level action predictions are first produced and then aggregated to form a video-level prediction. Although there are trials to improve snippet-level predictions by modeling temporal relationships, we argue that those implementations have not sufficiently exploited such information. In this paper, we propose Multi-Modal Plateau Transformers (M2PT) for WS-TAL by simultaneously exploiting temporal relationships among snippets, complementary information across data modalities, and temporal coherence among consecutive snippets. Specifically, M2PT explores a dual-Transformer architecture for RGB and optical flow modalities, which models intra-modality temporal relationship with a self-attention mechanism and inter-modality temporal relationship with a cross-attention mechanism. To capture the temporal coherence that consecutive snippets are supposed to be assigned with the same action, M2PT deploys a Plateau model to refine the temporal localization of action segments. Experimental results on popular benchmarks demonstrate that our proposed M2PT achieves state-of-the-art performance.
Weakly supervised temporal action localization (TAL) aims to localize the action instances in untrimmed videos using only video-level action labels. Without snippet-level labels, this task should be hard to distinguish all snippets with accurate action/background categories. The main difficulties are the large variations brought by the unconstraint background snippets and multiple subactions in action snippets. The existing prototype model focuses on describing snippets by covering them with clusters (defined as prototypes). In this work, we argue that the clustered prototype covering snippets with simple variations still suffers from the misclassification of the snippets with large variations. We propose an ensemble prototype network (EPNet), which ensembles prototypes learned with consensus-aware clustering. The network stacks a consensus prototype learning (CPL) module and an ensemble snippet weight learning (ESWL) module as one stage and extends one stage to multiple stages in an ensemble learning way. The CPL module learns the consensus matrix by estimating the similarity of clustering labels between two successive clustering generations. The consensus matrix optimizes the clustering to learn consensus prototypes, which can predict the snippets with consensus labels. The ESWL module estimates the weights of the misclassified snippets using the snippet-level loss. The weights update the posterior probabilities of the snippets in the clustering to learn prototypes in the next stage. We use multiple stages to learn multiple prototypes, which can cover the snippets with large variations for accurate snippet classification. Extensive experiments show that our method achieves the state-of-the-art weakly supervised TAL methods on two benchmark datasets, that is, THUMOS’14, ActivityNet v1.2, and ActivityNet v1.3 datasets.
Weakly-supervised Temporal Action Localization (W-TAL) aims to train a model to localize all action instances potentially from different classes in an untrimmed video, using a training dataset that has video-level action class labels but has no detailed annotations on the start and end timestamps of action instances. We propose to solve the W-TAL problem from the feature learning aspect, with a new architecture, termed F3-Net, which includes (1) a Feature Weakening (FW) module that can identify and randomly weaken either the most discriminative action or the most discriminative background features over the training iterations to force the network to precisely localize the action instances in both discriminative and ambiguous action-related frames, without spreading to the background intervals; (2) a Feature Contextualization (FC) module that can infer the global contexts among video segments and attentionally fuse them with the local contexts from individual video segments to generate more representative features; and (3) a Feature Discrimination (FD) module that can highlight the most discriminative video segments/classes corresponding to each class/segment, respectively, for localizing multiple action instances from different classes within a video. Experimental results on THUMOS14 and ActivityNet1.3 demonstrate the state-of-the-art performance of our F3-Net, and the FW and FC are also effective plug-in modules to improve other methods. This project will be available at https://moniruzzamanmd.github.io/F3-Net/https://moniruzzamanmd.github.io/F3-Net/
Temporal Action Localization (TAL) involves localizing and classifying action snippets in an untrimmed video. The emergence of large video foundation models has led RGB-only video backbones to outperform previous methods needing both RGB and optical flow modalities. Leveraging these large models is often limited to training only the TAL head due to the prohibitively large GPU memory required to adapt the video backbone for TAL. To overcome this limitation, we introduce LoSA, the first memory-and-parameter-efficient backbone adapter designed specifically for TAL to handle untrimmed videos. LoSA specializes for TAL by introducing Long-Short-range Adapters that adapt the intermediate layers of the video backbone over different temporal ranges. These adapters run parallel to the video backbone to significantly reduce memory footprint. LoSA also includes Long-Short-range Gated Fusion that strategically combines the output of these adapters from the video backbone layers to enhance the video features provided to the TAL head. Experiments show that LoSA significantly outperforms all existing methods on standard TAL benchmarks, THUMOS-14 and ActivityNet-v1.3, by scaling end-to-end backbone adaptation to billion-parameter-plus models like VideoMAEv2 (ViT-g) and leveraging them beyond head-only transfer learning.
To deal with the great number of untrimmed videos produced every day, we propose an efficient unsupervised action segmentation method by detecting boundaries, named action boundary detection (ABD). In particular, the proposed method has the following advantages: no training stage and low-latency inference. To detect action boundaries, we estimate the similarities across smoothed frames, which inherently have the properties of internal consistency within actions and external discrepancy across actions. Under this circumstance, we successfully transfer the boundary detection task into the change point detection based on the similarity. Then, non-maximum suppression (NMS) is conducted in local windows to select the smallest points as candidate boundaries. In addition, a clustering algorithm is followed to refine the initial proposals. Moreover, we also extend ABD to the online setting, which enables real-time action segmentation in long untrimmed videos. By evaluating on four challenging datasets, our method achieves state-of-the-art performance. Moreover, thanks to the efficiency of ABD, we achieve the best trade-off between the accuracy and the inference time compared with existing unsupervised approaches.
It is challenging to generate temporal action proposals from untrimmed videos. In general, boundary-based temporal action proposal generators are based on detecting temporal action boundaries, where a classifier is usually applied to evaluate the probability of each temporal action location. However, most existing approaches treat boundaries and contents separately, which neglect that the context of actions and the temporal locations complement each other, resulting in incomplete modeling of boundaries and contents. In addition, temporal boundaries are often located by exploiting either local clues or global information, without mining local temporal information and temporal-to-temporal relations sufficiently at different levels. Facing these challenges, a novel approach named multi-level content-aware boundary detection (MCBD) is proposed to generate temporal action proposals from videos, which jointly models the boundaries and contents of actions and captures multi-level (i.e., frame level and proposal level) temporal and context information. Specifically, the proposed MCBD preliminarily mines rich frame-level features to generate one-dimensional probability sequences, and further exploits temporal-to-temporal proposal-level relations to produce two-dimensional probability maps. The final temporal action proposals are obtained by a fusion of the multi-level boundary and content probabilities, achieving precise boundaries and reliable confidence of proposals. The extensive experiments on the three benchmark datasets of THUMOS14, ActivityNet v1.3 and HACS demonstrate the effectiveness of the proposed MCBD compared to state-of-the-art methods. The source code of this work can be found in https://mic.tongji.edu.cn.
Temporal action segmentation is typically achieved by discovering the dramatic variances in global visual descriptors. In this paper, we explore the merits of local features by proposing the unsupervised framework of Object-centric Temporal Action Segmentation (OTAS). Broadly speaking, OTAS consists of self-supervised global and local feature extraction modules as well as a boundary selection module that fuses the features and detects salient boundaries for action segmentation. As a second contribution, we discuss the pros and cons of existing frame-level and boundary-level evaluation metrics. Through extensive experiments, we find OTAS is superior to the previous state-of-the-art method by 41% on average in terms of our recommended F1 score. Surprisingly, OTAS even outperforms the ground-truth human annotations in the user study. Moreover, OTAS is efficient enough to allow real-time inference.
The Online Detection of Action Start (ODAS) has attracted the attention of researchers because of its practical applications in areas such as security and emergency response. However, online detection of activity boundaries remains a challenging task due to the inherent ambiguity of boundary definition and the significant imbalance in the number of boundaries and nonboundary points. To address this issue, this study proposes a novel Distribution-aware Activity Boundary Representation (DABR) method that utilizes a continuous probability density function to smooth the probability of moments near activity boundaries. The proposed DABR reduces the penalty for detecting moments near ground-truth boundary points, while increasing the number of samples related to boundary points. Additionally, we introduce a two-stage framework that incorporates class-informed information in temporal localization for more efficient activity boundary localization. Extensive experiments demonstrate that our method achieves state-of-the-art results on two standard datasets, particularly exhibiting a significant improvement of 11.5% at average p-mAP on the THUMOS'14 dataset.
Temporal action detection (TAD) is challenging, yet fundamental for real-world video applications. Large temporal scale variation of actions is one of the most primary difficulties in TAD. Naturally, multi-scale features have potential in localizing actions of diverse lengths as widely used in object detection. Nevertheless, unlike objects in images, actions have more ambiguity in their boundaries. That is, small neighboring objects are not considered as a large one while short adjoining actions can be misunderstood as a long one. In the coarse-to-fine feature pyramid via pooling, these vague action boundaries can fade out, which we call 'vanishing boundary problem'. To this end, we propose Boundary-Recovering Network (BRN) to address the vanishing boundary problem. BRN constructs scale-time features by introducing a new axis called scale dimension by interpolating multi-scale features to the same temporal length. On top of scale-time features, scale-time blocks learn to exchange features across scale levels, which can effectively settle down the issue. Our extensive experiments demonstrate that our model outperforms the state-of-the-art on the two challenging benchmarks, ActivityNet-v1.3 and THUMOS14, with remarkably reduced degree of the vanishing boundary problem.
No abstract available
Temporal action detection aims to recognize the action category and determine each action instance's starting and ending time in untrimmed videos. The mixed method has demonstrated notable performance by integrating both anchor-based and anchor-free approaches. However, while it leverages the strengths of each method, it also retains their respective limitations. For instance, the anchor-based approach depends on manually crafted anchors tailored to specific datasets, while the anchor-free approach predicts potential action instances at each temporal position, resulting in a significant number of false positives in category prediction. The inclusion of these limitations undermines the potential benefits of the mixed method. In this paper, we propose a novel Boundary Discretization and Reliable Classification Network (BDRC-Net) that addresses the issues above by introducing boundary discretization and reliable classification modules. Specifically, the boundary discretization module (BDM) elegantly merges anchor-based and anchor-free approaches in the form of boundary discretization, eliminating the need for the traditional handcrafted anchor design. Furthermore, the reliable classification module (RCM) predicts reliable global action categories to reduce false positives. Extensive experiments conducted on different benchmarks demonstrate that our proposed method achieves competitive detection performance.
Temporal action detection is a challenging task in video understanding, due to the complexity of the background and rich action content impacting high-quality temporal proposals generation in untrimmed videos. Capsule networks can avoid some limitations of the invariance caused by pooling and inability from convolutional neural networks, which can better understand the temporal relations for temporal action detection. However, because of the extremely computationally expensive procedure, capsule network is difficult to be applied to the task of temporal action detection. To address this issue, this paper proposes a novel U-shaped capsule network framework with a k-Nearest Neighbor (k-NN) mechanism of 3D convolutional dynamic routing, which we named U-BlockConvCaps. Furthermore, we build a Capsules Boundary Network (CapsBoundNet) based on U-BlockConvCaps for dense temporal action proposal generation. Specifically, the first module is one 1D convolutional layer for fusing the two-stream with RGB and optical flow video features. The sampling module further processes the fused features to generate the 2D start-end action proposal feature maps. Then, the multi-scale U-Block convolutional capsule module with 3D convolutional dynamic routing is used to process the proposal feature map. Finally, the feature maps generated from the CapsBoundNet are used to predict starting, ending, action classification, and action regression score maps, which help to capture the boundary and intersection over union features. Our work innovatively improves the dynamic routing algorithm of capsule networks and extends the use of capsule networks to the temporal action detection task for the first time in the literature. The experimental results on benchmarks THUMOS14 show that the performance of CapsBoundNet is obviously beyond the state-of-the-art methods, e.g., the mAP@tIoU = 0.3, 0.4, 0.5 on THUMOS14 are improved from 63.6% to 70.0%, 57.8% to 63.1%, 51.3% to 52.9%, respectively. We also got competitive results on the action detection dataset of ActivityNet1.3.
Accurate boundary detection with the start and end of movement and its identification in video recognition have always been a challenging subject, especially in dance movement detection issue due the high complexity. This paper proposes a deep learning framework to perform effective feature extraction and movement evaluation. First, we constructed the dataset collecting 5 dance movements of modern dance, and we introduced expert knowledge to construct the annotations for the action keyframe dataset. Secondly, we utilize a dual-stream network algorithm to capture the features of movements in dance videos. Subsequently, we introduced key keyframes in constructing the start, the keyframe and the end of dance movement, which could enhance the supervision. Finally, the movement identification module was fulfilled by combining extracted action feature and keyframe features. The experimental results achieved a maximum recall of 89% in the boundary segmentation of dance videos, and achieved a maximum accuracy of 65% in dance movement identification. The results showed that the framework can realize highly accurate dance movement recognition and boundary recognition, which provides strong support for artistic creation, teaching and performance in the field of dance.
No abstract available
No abstract available
The task of temporal action detection aims to locate and classify action segments in untrimmed videos. Most existing works usually consist of two components: snippet-level boundary segmentation and anchor-level action evaluation. These two components, however, are typically designed ir-relevantly, so the detection accuracy is undermined due to vague boundaries and complex video content. To tackle this problem, we design two supplementary modules. One mod-ule, termed as Anchor Aware Module (AAM), uses tem-poral and semantic related anchors to enhance snippet feature. The other module, named Boundary Aware Module (BAM), endows anchor feature with structured representation using intermediate supervision. Moreover, the ConvL-STM is applied to establish temporal relation in BAM with the structured representation. These two modules are in-tegrated as the Boundary-Anchor Complementary Network (BACNet), which achieves the state-of-the-art performance on both THUMOS-14 and ActivityNet-1.3 datasets.
Fine-grained temporal action detection aims at predicting the categories and locating the boundaries of fine-grained action instances in long, untrimmed videos. The fine-grained classification of action instances brings new challenges to temporal action detection, which changes the distribution of action instances of different durations and increases the proportion of short action instances. However, the existing anchor-free detection methods cannot fully utilize global information and local information. Therefore, this paper proposes an anchor-free temporal action detection method with global feature enhancement and local boundary adjustment. Based on the feature pyramid, the attention mechanism of transformer is used to model long-range temporal dependencies between features at different locations of the same level and introduce global information from the upper-level of the feature pyramid to generate coarse predictions. To obtain local details, the interaction with the low-level feature is used to adjust the boundaries of coarse predictions. Experiments on FineAction demonstrate the effectiveness of this method.
Abstract Temporal action proposal generation is a fundamental yet challenging to locate the temporal action in untrimmed videos. Although current proposal generation methods can generate the precise boundary of actions, few focus on considering the relation of proposals. In this paper, we propose a unified framework to generate the temporal boundary proposals with a graph convolution network based on the boundary proposals' feature named Boundary Graph Convolutional Network (BGCN). BGCN draws inspiration from boundary methods and uses edge graph convolution relay on the boundary proposals' feature. First, we use a base layer to fusion the two-stream video features to get two-branches of base features. Then the two-branches of base features enter into the same structure of Proposal Features Graph Convolutional Network (PFGCN): Action PFGCN to extract the action classification score and Boundary PFGCN to extract the ending score and staring score. In proposal features graph convolutional network, we first densely sampled the proposals' feature from the video features. We construct a proposal feature graph, where each proposal feature as a node and their relations between proposals' features as an edge with edge convolution for graph convolution. After that, map the relations into a 2D map score. Experiments on popular benchmarks THUMOS14 demonstrate the superiority of BGCN over (44.8% versus 42.8% at tIoU 0.5) the state-of-the-art proposal generator (e.g., G-TAD, TAL-Net, and BMN) at any of tIoU thresholds from 0.3 to 0.7. On ActivityNet1.3, BGCN also got better results. Moreover, BGCN has high efficiency for action detection with less than 2 MB model size and fast inference time. GCN based on boundary generation for densely produce the action proposals Efficient and novel BGCN model has a great capability to learn the proposal features Has a lower model size for temporal action proposals generation Has fast inference time for temporal action proposals generation.
No abstract available
No abstract available
Temporal action localization (TAL) is crucial in video analysis, yet presents notable challenges. This process focuses on the precise identification and categorization of action instances within lengthy, raw videos. A key difficulty in TAL lies in determining the exact start and end points of actions, owing to the often unclear boundaries of these actions in real-world footage. Existing methods tend to take insufficient account of changes in action boundary features. To tackle these issues, we propose a boundary awareness network (BAN) for TAL. Specifically, the BAN mainly consists of a feature encoding network, coarse pyramidal detection to obtain preliminary proposals and action categories, and fine-grained detection with a Gaussian boundary module (GBM) to get more valuable boundary information. The GBM contains a novel Gaussian boundary pooling, which serves to aggregate the relevant features of the action boundaries and to capture discriminative boundary and actionness features. Furthermore, we introduce a novel approach named Boundary Differentiated Learning (BDL) to ensure our model’s capability in accurately identifying action boundaries across diverse proposals. Comprehensive experiments on both the THUMOS14 and ActivityNet v1.3 datasets, where our BAN model achieved an increase in mean Average Precision (mAP) by 1.6% and 0.2%, respectively, over existing state-of-the-art methods, illustrate that our approach not only improves upon the current state of the art but also achieves outstanding performance.
Abstract. Temporal action detection (TAD) is a challenging task in the field of video understanding. We determine the semantic labels and precise boundaries of each action instance in an untrimmed video. Over the years, a variety of networks have been proposed, including convolution, graph, and transformer, which have been effectively applied in TAD tasks. Most of the methods have been able to identify the action category well; however, the accuracy of determining the action boundary is still insufficient. Because an action contains several consecutive frames of similar images, we recommend picking out the key frames in the video sequence and enhancing the TAD representation by extracting additional features of the key frames. We propose KeyMamba, a state-space model-based learnable network for TAD tasks. The proposed model applies a bidirectional Mamba block to capture global features efficiently. We also added a temporal deformable attention module to extract key frame features from video clips. These features contain the information of motion changes, and the key frame features complement the global features, which can identify the video action boundaries more accurately. In addition, to get a higher quality Token in the spatial dimension, we added an attention mask before the bidirectional Mamba block encoder. Finally, we also apply masking operations during the forward and backward scanning processes within the bidirectional Mamba block to mitigate the impact of duplicate tokens. Our experiments have achieved outstanding performance on the THUMOS14 and ActivityNet-1.3 datasets, reaching an average mAP of 70.4 on THUMOS14 and an average mAP of 38.44 on ActivityNet-1.3.
: End-to-end Temporal Action Detection (TAD) has achieved remarkable progress in recent years, driven by innovations in model architectures and the emergence of Video Foundation Models (VFMs). However, existing TAD methods that perform full fine-tuning of pretrained video models often incur substantial computational costs, which become particularly pronounced when processing long video sequences. Moreover, the need for precise temporal boundary annotations makes data labeling extremely expensive. In low-resource settings where annotated samples are scarce, direct fine-tuning tends to cause overfitting. To address these challenges, we introduce Dynamic Low-Rank Adapter (DyLoRA), a lightweight fine-tuning framework tailored specifically for the TAD task. Built upon the Low-Rank Adaptation (LoRA) architecture, DyLoRA adapts only the key layers of the pretrained model via low-rank decomposition, reducing the number of trainable parameters to less than 5% of full fine-tuning methods. This significantly lowers memory consumption and mitigates overfitting in low-resource settings. Notably, DyLoRA enhances the temporal modeling capability of pretrained models by optimizing temporal dimension weights, thereby alleviating the representation misalignment of temporal features. Experimental results demonstrate that DyLoRA-TAD achieves impressive performance, with 73.9% mAP on THUMOS14, 39.52% on ActivityNet-1.3, and 28.2% on Charades, substantially surpassing the best traditional feature-based methods.
No abstract available
Temporal action detection (TAD), which locates and recognizes action segments, remains a challenging task in video understanding due to variable segment lengths and ambiguous boundaries. Existing methods treat neighboring contexts of an action segment indiscriminately, leading to imprecise boundary predictions. We introduce a single-stage ContextDet framework, which makes use of large-kernel convolutions in TAD for the first time. Our model features a pyramid adaptive context aggragation (ACA) architecture, capturing long context and improving action discriminability. Each ACA level consists of two novel modules. The context attention module (CAM) identifies salient contextual information, encourages context diversity, and preserves context integrity through a context gating block (CGB). The long context module (LCM) makes use of a mixture of large- and small-kernel convolutions to adaptively gather long-range context and fine-grained local features. Additionally, by varying the length of these large kernels across the ACA pyramid, our model provides lightweight yet effective context aggregation and action discrimination. We conducted extensive experiments and compared our model with a number of advanced TAD methods on six challenging TAD benchmarks: MultiThumos, Charades, FineAction, EPIC-Kitchens 100, Thumos14, and HACS, demonstrating superior accuracy at reduced inference speed.
Temporal Action Detection (TAD), the task of localizing and classifying actions in untrimmed video, remains challenging due to action overlaps and variable action durations. Recent findings suggest that TAD performance is dependent on the structural design of transformers rather than on the self-attention mechanism. Building on this insight, we propose a refined feature extraction process through lightweight, yet effective operations. First, we employ a local branch that employs parallel convolutions with varying window sizes to capture both fine-grained and coarse-grained temporal features. This branch incorporates a gating mechanism to select the most relevant features. Second, we introduce a context branch that uses boundary frames as key-value pairs to analyze their relationship with the central frame through cross-attention. The proposed method captures temporal dependencies and improves contextual understanding. Evaluations of the gating mechanism and context branch on challenging datasets (THUMOS14 and EPIC-KITCHEN 100) show a consistent improvement over the baseline and existing methods.
Online detection of action start is a significant and challenging task that requires prompt identification of action start positions and corresponding categories within streaming videos. This task presents challenges due to data imbalance, similarity in boundary content, and real‐time detection requirements. Here, a novel Time‐Attentive Fusion Network is introduced to address the requirements of improved action detection accuracy and operational efficiency. The time‐attentive fusion module is proposed, which consists of long‐term memory attention and the fusion feature learning mechanism, to improve spatial‐temporal feature learning. The temporal memory attention mechanism captures more effective temporal dependencies by employing weighted linear attention. The fusion feature learning mechanism facilitates the incorporation of current moment action information with historical data, thus enhancing the representation. The proposed method exhibits linear complexity and parallelism, enabling rapid training and inference speed. This method is evaluated on two challenging datasets: THUMOS’14 and ActivityNet v1.3. The experimental results demonstrate that the proposed method significantly outperforms existing state‐of‐the‐art methods in terms of both detection accuracy and inference speed.
Boundary localization is a challenging problem in Temporal Action Detection (TAD), in which there are two main issues. First, the submergence of movement feature, i.e. the movement information in a snippet is covered by the scene information. Second, the scale of action, that is, the proportion of action segments in the entire video, is considerably variable. In this work, we first design a Movement Enhance Module (MEM) to highlight movement feature for better action location, and then, we propose a Scale Feature Pyramid Network (SFPN) to detect multi-scale actions in videos. For Movement Enhance Module, firstly, Movement Feature Extractor (MFE) is designed to get the movement feature. Secondly, we propose a Multi-Relation Enhance Module (MREM) to grasp valuable information correlation both locally and temporally. For Scale Feature Pyramid Network, we design a U-Shape Module to model different scale actions, moreover, we design the training and inference strategy of different scales, ensuring that each pyramid layer is only responsible for actions at a specific scale. These two innovations are integrated as the Movement Enhance Network (MENet), and extensive experiments conducted on two challenging benchmarks demonstrate its effectiveness. MENet outperforms other representative TAD methods on ActivityNet-1.3 and THUMOS-14.
Cricket Bowl Release Detection aims to segment specific portions of bowl release actions occurring in multiple videos, with a focus on detecting the entire time window of this action. Unlike traditional detection tasks that identify action categories at a specific moment, this task involves identifying events that typically span around 100 frames and require recognizing all instances of the bowl release action in the video. Strictly speaking, this task falls under a branch of temporal action detection. With the advancement of deep neural networks, recent works have proposed deep learning-based approaches to address this task. However, due to the challenge of unclear action boundaries in videos, many existing methods perform poorly on the DeepSportradar Cricket Bowl Release Dataset. To more accurately identify specific portions of the bowl release action in videos, we adopt a one-stage architecture based on Relative Boundary Modeling. Specifically, our method consists of three stages. In the first stage, we use the Inflated 3D ConvNet (I3D) model to extract spatio-temporal features from the input videos. In the second stage, we utilize Temporal Action Detection with Relative Boundary Modeling (TriDet) to model the boundaries of the bowl release action's specific portions based on the relative relationships between different time moments, thereby predicting the action's time window. Lastly, as the target events typically span around 100 frames and the predicted time windows may exhibit overlapping regions based on confidence scores, we implement a post-processing step to merge and filter these outputs, resulting in the final submission results. We conducted extensive experiments to demonstrate that our proposed method achieves superior performance. Additionally, we evaluated the training techniques of existing approaches. Our proposed method achieves a PQ score of 0.519, an SQ score of 0.822, and an RQ score of 0.632 on the challenge set of the DeepSportradar Cricket Bowl Release Dataset. Through this approach, our team, USTC\_IAT\_United, won the third place in the first phase of the DeepSportradar Cricket Bowl Release Challenge.
Temporal Action Detection (TAD) aims to identify action boundaries and their corresponding categories in untrimmed videos, playing a crucial role in long-video understanding. Prior works often struggle to balance the trade-off between capturing long-range dependencies and ensuring computational efficiency. Recently, the state space model Mamba has exhibited impressive capabilities and efficiency in long-term sequence modeling. However, current methods based on Mamba generally lack a unified framework to simultaneously address the redundancy of long-duration actions and the boundary sensitivity of short-duration actions—limitations that largely stem from Mamba’s reliance on limited state representations and its unidirectional modeling. To tackle the aforementioned challenges, we propose DilatedTAD, a novel TAD framework with an expanded receptive field. DilatedTAD leverages the Inter-Parallel DIM component (InterDIM) to integrate multi-scale temporal information, enabling a better trade-off between short-duration and long-duration action detection. InterDIM is built upon our proposed Dilated Mamba (DIM), where multiple DIM branches with different dilation rates are designed to focus on actions of varying durations. Specifically, DIM introduces a novel use of dilation to skip redundant temporal information, thereby enhancing the model’s focus on crucial boundary features. Additionally, a bidirectional modeling design is adopted in DIM to compensate for the lack of future temporal context in the original Mamba architecture. Extensive experiments show that DilatedTAD outperforms state-of-the-art methods on multiple datasets, achieving mAPs of 74.9% (THUMOS14), 42.90% (ActivityNet 1.3), 45.0% (HACS), and 26.3% and 24.3% (EPIC-Kitchens 100). Our code will be publicly available.
Temporal action proposal generation plays a vital role in the analysis of untrimmed videos and has garnered growing interest from researchers. Nevertheless, the presence of long-term temporal dependencies and the large variation in action durations within untrimmed videos pose significant challenges for accurately localizing action boundaries. To overcome the aforementioned issues, we design a novel Transformer-based Temporal Feature Pyramid Network (TTFPN) tailored for generating action proposals. Specifically, we introduce a local transformer to capture longterm temporal information while reducing computational complexity through the substitution of conventional selfattention with a localized variant. Subsequently, a temporal feature pyramid is built to produce multi-scale representations, enabling the model to effectively handle action instances of varying durations. Based on this temporal feature pyramid, we employ a convolutional network-based predictor to generate action proposals in an anchor-free manner. We evaluate TTFPN on THUMOS14, a standard benchmark for temporal action detection, to validate its effectiveness. The results show that TTFPN achieves competitive performance and significantly outperforms previous methods.
As the cornerstone of human-behavior analysis in video understanding, temporal action proposal generation aims to predict the starting and ending time of human action instances in untrimmed videos. Although large achievements in temporal action proposal generation have been achieved, most previous studies ignore the variability of action frequency in raw videos, leading to unsatisfying performances on high-action-frequency videos. In fact, there exists two main issues which should be well addressed: data imbalance between high and low action-frequency videos, and inferior detection of short actions in high-action-frequency videos. To address the above issues, we propose an effective framework by adapting to the variability of action frequency, namely Action Frequency Adaptive Network (AFAN), which can be flexibly built upon any temporal action proposal generation method. AFAN consists of two modules: Learning From Experts (LFE) and Fine-Grained Processing (FGP). The LFE first trains a series of action proposal generators on different subsets of imbalanced data as experts and then teaches a unified student model via knowledge distillation. To better detect short actions, FGP first finds out high-action-frequency videos and then performs fine-grained detection. Extensive experimental results on four benchmark datasets (ActivityNet-1.3, HACS, THUMOS14 and FineAction) demonstrate the effectiveness and generalizability of the proposed AFAN, especially for high-action-frequency videos.
Temporal action proposal generation (TAPG) aims to locate action instances in untrimmed videos for video analysis tasks. In this paper, we propose a novel approach called Internal Location Assistance Net (ILAN) to take advantage of the internal action point instead of only utilizing the start and end points themselves. Specifically, instead of solely predicting the action’s start and end positions, we also predict extra internal points which could be the one-eighth, one-fourth or center points etc. Then the pair of left and right internal positions of the action are matched to generate center region proposals. Then the predicted center region is constrained to lie in the predicted overall action region. And both confidences of the overall action region and the center region are combined to get the final action proposal. Besides, we incorporate the window transformer to enhance the feature extraction for capturing more precise action boundaries. Extensive experiments are conducted on two popular benchmark datasets: THUMOS14 and ActivityNet-v1.3. The experimental results demonstrate that the proposed method outperforms the state-of-the-arts.
Temporal action proposal generation is a method for extracting temporal action instances or proposals from untrimmed videos. Existing methods often struggle to segment contiguous action proposals, which are a group of action boundaries with small temporal gaps. To address this limitation, we propose incorporating an attention mechanism to weigh the importance of each proposal within a contiguous group. This mechanism leverages the gap displacement between proposals to calculate attention scores, enabling a more accurate localization of action boundaries. We evaluate our method against a state-of-the-art boundary-based baseline on ActivityNet v1.3 and Thumos 2014 datasets. The experimental results demonstrate that our approach significantly improves the performance of short-duration and contiguous action proposals, achieving an average recall of 78.22%.
Temporal action proposal generation (TAPG) is a fundamental and challenging task in media interpretation and video understanding, especially in temporal action detection. Most previous works focus on capturing the local temporal context and can well locate simple action instances with clean frames and clear boundaries. However, they generally fail in complicated scenarios where interested actions involve irrelevant frames and background clutters, and the local temporal context becomes less effective. To deal with these problems, we present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG. Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer, and it improves the abilities of capturing long-range dependencies and learning robust feature for noisy action instances. Moreover, an adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features. The features from the two modules carry rich semantic information of the video, and are fused for effective sequential proposal generation. Extensive experiments are conducted on two challenging datasets, THUMOS14 and ActivityNet1.3, and the results demonstrate that our method outperforms state-of-the-art TAPG methods. Our code will be released soon.
No abstract available
Transformer networks are effective at modeling long-range contextual information and have recently demonstrated exemplary performance in the natural language processing domain. Conventionally, the temporal action proposal generation (TAPG) task is divided into two main sub-tasks: boundary prediction and proposal confidence prediction, which rely on the frame-level dependencies and proposal-level relationships separately. To capture the dependencies at different levels of granularity, this paper intuitively presents a unified temporal action proposal generation framework with original Transformers, called TAPG Transformer, which consists of a Boundary Transformer and a Proposal Transformer. Specifically, the Boundary Transformer captures long-term temporal dependencies to predict precise boundary information and the Proposal Transformer learns the rich inter-proposal relationships for reliable confidence evaluation. Extensive experiments are conducted on two popular benchmarks: ActivityNet-1.3 and THUMOS14, and the results demonstrate that TAPG Transformer outperforms state-of-the-art methods. Equipped with the existing action classifier, our method achieves remarkable performance on the temporal action localization task. Codes and models will be available.
By conditioning on unit-level predictions, anchor-free models for action proposal generation have displayed impressive capabilities, such as having a lightweight architecture. However, task performance depends significantly on the quality of data used in training, and most effective models have relied on human-annotated data. Semi-supervised learning, i.e., jointly training deep neural networks with a labeled dataset as well as an unlabeled dataset, has made significant progress recently. Existing works have either primarily focused on classification tasks, which may require less annotation effort, or considered anchor-based detection models. Inspired by recent advances in semi-supervised methods on anchor-free object detectors, we propose a teacher-student framework for a two-stage action detection pipeline, named Temporal Teacher with Masked Transformers (TTMT), to generate high-quality action proposals based on an anchor-free transformer model. Leveraging consistency learning as one self-training technique, the model jointly trains an anchor-free student model and a gradually progressing teacher counterpart in a mutually beneficial manner. As the core model, we design a Transformer-based anchor-free model to improve effectiveness for temporal evaluation. We integrate bi-directional masks and devise encoder-only Masked Transformers for sequences. Jointly training on boundary locations and various local snippet-based features, our model predicts via the proposed scoring function for generating proposal candidates. Experiments on the THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of our model for temporal proposal generation task.
Temporal Action Proposal Generation (TAPG) is a challenging task at present, which aims to accurately generate temporal proposals that are likely to contain human actions in an untrimmed video. Inspired by the successful pre-training strategy applied by the transformer-based image object detection method, we propose a Self-Supervised Pre-Training transformer (SSPT) method for the TAPG task to train an action locator without any labels. As far as we know, this is the first work to explore self-supervised pre-training for transformer for TAPG. Specifically, the pre-training strategy leverages the transformer model to model the contextual information and we leverage transformer to capture contextual semantic information and global temporal dependencies of actions in the entire video can better accomplish the TAPG task. And in the pre-training process, we adapt Random Query Segments, which randomly crop multiple segments from the original video. We treat the segments as pseudo-labels and input them as queries to the transformer decoder. The pretext we designed is to locate the start time and end time of the pseudo-labels in the original video. Extensive experiments on THUMOS14 demonstrate the effectiveness of SSPT, the results show that it improves the performance of baseline and leads to a higher recall for the Temporal Action Proposal Generation task.
Temporal action proposal generation (TAPG) is a challenging task that aims to locate action instances in untrimmed videos with temporal boundaries. To evaluate the confidence of proposals, the existing works typically predict action score of proposals that are supervised by the temporal Intersection-over-Union (tIoU) between proposal and the ground-truth. In this paper, we innovatively propose a general auxiliary Background Constraint idea to further suppress low-quality proposals, by utilizing the background prediction score to restrict the confidence of proposals. In this way, the Background Constraint concept can be easily plug-and-played into existing TAPG methods (BMN, GTAD). From this perspective, we propose the Background Constraint Network (BCNet) to further take advantage of the rich information of action and background. Specifically, we introduce an Action-Background Interaction module for reliable confidence evaluation, which models the inconsistency between action and background by attention mechanisms at the frame and clip levels. Extensive experiments are conducted on two popular benchmarks, ActivityNet-1.3 and THUMOS14. The results demonstrate that our method outperforms state-of-the-art methods. Equipped with the existing action classifier, our method also achieves remarkable performance on the temporal action localization task.
Temporal action proposal generation (TAPG) aims to estimate temporal intervals of actions in untrimmed videos, which is a challenging yet plays an important role in many tasks of video analysis and understanding. Despite the great achievement in TAPG, most existing works ignore the human perception of interaction between agents and the surrounding environment by applying a deep learning model as a black-box to the untrimmed videos to extract video visual representation. Therefore, it is beneficial and potentially improves the performance of TAPG if we can capture these interactions between agents and the environment. In this paper, we propose a novel framework named Agent-Aware Boundary Network (ABN), which consists of two sub-networks: (1) an Agent-Aware Representation Network to obtain both agent-agent and agents-environment relationships in the video representation; and (2) a Boundary Generation Network to estimate the confidence score of temporal intervals. In the Agent-Aware Representation Network, the interactions between agents are expressed through local pathway, which operates at a local level to focus on the motions of agents whereas the overall perception of the surroundings are expressed through global pathway, which operates at a global level to perceive the effects of agents-environment. Comprehensive evaluations on 20-action THUMOS-14 and 200-action ActivityNet-1.3 datasets with different backbone networks (i.e C3D, SlowFast and Two-Stream) show that our proposed ABN robustly outperforms state-of-the-art methods regardless of the employed backbone network on TAPG. We further examine the proposal quality by leveraging proposals generated by our method onto temporal action detection (TAD) frameworks and evaluate their detection performances.
It has been found that temporal action proposal generation, which aims to discover the temporal action instances within the range of the start and end frames in the untrimmed videos, can largely benefit from proper temporal and semantic context exploitation. The latest efforts were dedicated to considering the temporal context and similarity-based semantic contexts through self-attention modules. However, they still suffer from cluttered background information and limited contextual feature learning. In this paper, we propose a novel Pyramid Region-based Slot Attention (PRSlot) module to address these issues. Instead of using the similarity computation, our PRSlot module directly learns the local relations in an encoder-decoder manner and generates the representation of a local region enhanced based on the attention over input features called \textit{slot}. Specifically, upon the input snippet-level features, PRSlot module takes the target snippet as \textit{query}, its surrounding region as \textit{key} and then generates slot representations for each \textit{query-key} slot by aggregating the local snippet context with a parallel pyramid strategy. Based on PRSlot modules, we present a novel Pyramid Region-based Slot Attention Network termed PRSA-Net to learn a unified visual representation with rich temporal and semantic context for better proposal generation. Extensive experiments are conducted on two widely adopted THUMOS14 and ActivityNet-1.3 benchmarks. Our PRSA-Net outperforms other state-of-the-art methods. In particular, we improve the AR@100 from the previous best 50.67% to 56.12% for proposal generation and raise the mAP under 0.5 tIoU from 51.9\% to 58.7\% for action detection on THUMOS14. \textit{Code is available at} \url{https://github.com/handhand123/PRSA-Net}
Temporal action proposal generation aims to localize temporal segments of human activities in videos. Current boundary-based proposal generation methods can generate proposals with precise boundary but often suffer from the inferior quality of confidence scores used for proposal retrieving. In this article, we propose an effective and end-to-end action proposal generation method, named ProposalVLAD, with Proposal-Intra Exploring Network (PVPI-Net). We first propose a ProposalVLAD module to dynamically generate global features of the entire video, then we combine the global features and proposal local features to generate the final feature representations for all candidate proposals. Then, we design a novel Proposal-Intra Loss function (PI-Loss) to generate more reliable proposal confidence scores. Extensive experiments on large-scale and challenging datasets demonstrate the effectiveness of our proposed method. Experimental results show that our PVPI-Net achieves significant improvements on two benchmark datasets (i.e., THUMOS’14 and ActivityNet-1.3) and sets new records for temporal action detection task.
No abstract available
Temporal action proposal generation is an essential and challenging task that aims at localizing temporal intervals containing human actions in untrimmed videos. Most of existing approaches are unable to follow the human cognitive process of understanding the video context due to lack of attention mechanism to express the concept of an action or an agent who performs the action or the interaction between the agent and the environment. Based on the action definition that a human, known as an agent, interacts with the environment and performs an action that affects the environment, we propose a contextual Agent-Environment Network. Our proposed contextual AEN involves (i) agent pathway, operating at a local level to tell about which humans/agents are acting and (ii) environment pathway operating at a global level to tell about how the agents interact with the environment. Comprehensive evaluations on 20-action THUMOS-14 and 200-action ActivityNet-1.3 datasets with different backbone networks, i.e C3D and SlowFast, show that our method robustly exhibits outperformance against state-of-the-art methods regardless of the employed backbone network.
The goal of Temporal Action Localization (TAL) is to find the categories and temporal boundaries of actions in an untrimmed video. Most TAL methods rely heavily on action recognition models that are sensitive to action labels rather than temporal boundaries. More importantly, few works consider the background frames that are similar to action frames in pixels but dissimilar in semantics, which also leads to inaccurate temporal boundaries. To address the challenge above, we propose a Boundary-Aware Proposal Generation (BAPG) method with contrastive learning. Specifically, we define the above background frames as hard negative samples. Contrastive learning with hard negative mining is introduced to improve the discrimination of BAPG. BAPG is independent of the existing TAL network architecture, so it can be applied plug-and-play to mainstream TAL models without training. Extensive experimental results on THUMOS14 and ActivityNet-1.3 demonstrate that BAPG can significantly improve the performance of TAL.
Temporal action proposal generation aims to generate temporal video segments containing human actions in untrimmed videos, which is always a preliminary for such video understanding tasks as action localization and temporally description grounding, etc. Fully-supervised solutions, though proven to be effective, suffer much from heavy data annotation overhead. To address this problem, this paper focuses on a rarely investigated yet practical problem of semi-supervised learning for temporal action proposal generation. Firstly, we propose a Proposal Map oriented Mean-Teacher (PM-MT) model, which can use both labeled and unlabeled data for end-to-end model training. Secondly, a Suppression-and-Re-Generation (SRG) strategy is designed to generate high-quality pseudo labels for unlabeled data, which are then used to finetune the model. Extensive experiments demonstrate the effectiveness of our proposed method, by achieving the state-of-the-art results on two public benchmark datatsets on the task of semi-supervised action proposal generation and outperforming fully-supervised learning methods with only a portion of labeled data.
No abstract available
Temporal action detection (TAD) aims to detect the semantic labels and boundaries of action instances in untrimmed videos. Current mainstream approaches are multi-step solutions, which fall short in efficiency and flexibility. In this paper, we propose a unified network for TAD, termed Faster-TAD, by re-purposing a Faster-RCNN like architecture. To tackle the unique difficulty in TAD, we make important improvements over the original framework. We propose a new Context-Adaptive Proposal Module and an innovative Fake-Proposal Generation Block. What's more, we use atomic action features to improve the performance. Faster-TAD simplifies the pipeline of TAD and gets remarkable performance on lots of benchmarks, i.e., ActivityNet-1.3 (40.01% mAP), HACS Segments (38.39% mAP), SoccerNet-Action Spotting (54.09% mAP). It outperforms existing single-network detector by a large margin.
No abstract available
No abstract available
Temporal action detection is a fundamental yet challenging video understanding task. The calculation of confidence score for each generated action proposals remains the bottleneck of this task. Given that the continuity of videos is beneficial for self-supervised learning, in this paper we propose Self-attention Assisted Ranking Network (SARNet), which uses a self-attention mechanism to assist the ranking and retrieval of generated proposals. Our method incorporates a discriminative and a generative constraint to train the self-attention weight. Extensive experiments on THUMOS14 demonstrate that our method achieves a considerable improvement of average recall with a small number of proposals, and brings the mAP up to 30% at tIoU threshold 0.7 for the first time.
Temporal action proposal generation aims to generate temporal boundaries containing action instances. In real-time applications such as surveillance cameras, autonomous driving, and traffic monitoring, the online localization and recognition of human activities occurring in short temporal intervals are important areas of research. Existing approaches of temporal action proposal generation consider only the offline and frame-level feature aggregation along the temporal dimension. Those offline methods also generate many redundant irrelevant proposal regions in the frames as temporal boundaries. This leads to higher computational cost along with slow processing speed which is not suitable for online tasks. In this study, we propose a novel spatio-temporal attention network for online action proposal generation as opposed to existing offline proposal generation methods. Our novel proposed approach incorporates the inter-dependency between the spatial and temporal context information of each incoming video clip to generate more relevant online temporal action proposals. First, we propose a windowed spatial attention module to capture the inter-spatial relationship between the features of incoming frames. The windowed spatial network produces more robust clip-level feature representation and efficiently deals with noisy features such as occlusion or background scenes. Second, we introduce a temporal attention module to capture relevant temporal dynamic information mutually to the localized spatial information to model the long inter-frame temporal relationship since most online real life videos are untrimmed in nature. By applying these two attention modules sequentially, the novel proposed spatio-temporal network model is able to generate precise action boundaries at a particular instant of time. In addition, the model generates fewer discriminative temporal action proposals while maintaining a low computational cost and high processing speed suitable for online settings.
Temporal action proposal generation is an important and challenging task, aiming to localize the position where an action or event may occur in an untrimmed video. In this paper, we propose an efficient and end-to-end framework to generate temporal action proposals, named Phase-Sensitive Model (PSM), which fully understands all phases of temporal information. In particular, the PSM consists two modules: Boundary Phase Classification (BPC) and Action Phase Classification (APC). The BPC aims to provide two temporal boundary phase confidence maps by rich local information, while the APC is designed to generate an action phase confidence map by global features. Moreover, we introduce a new method boundary probability calculation to get the final score. Our experiments on ActivityNet-1.3 show a significant improvement with remarkable efficiency and generalizability.
Temporal action detection is a practical but challenging task. The current temporal action detection task has major shortcomings in the accuracy of proposal generation. The extraction of long-range temporal information features and the fusion of two-stream features for the task of temporal action proposal generation are still a challenge that needs to be improved. In this paper, we propose the Global Two-Stream Network, which innovatively introduces the Non-Local operation to extract the global background information from the features of the generated candidate proposals. And the backbone network of two-streams is used for better utilization of two-stream features to generate temporal action proposal segments with precise bounds and high confidence.
Temporal action detection is a fundamental yet challenging task in video content analysis. The performance of existing methods still remains far from satisfactory as the mAP reduces dramatically at high tIoU threshold. With the goal of predicting the starting and ending points more precisely, this work first introduces the action category label into the temporal proposal generation stage of the training process. Specifically, with the category information, we proposed two extra constrains, i.e, action based constraint and action-class agnostic constraints. The former aims at minimizing the discrepancy inside the same action category while the latter forces the feature of the samples aggregates in the same phase. Comprehensive experiments are conducted on the THUMOS’14 benchmark. A remarkable improvement of average recall is attained especially when the number of proposals is small. And our approach achieves 29.0% mAP at a strict tIoU@0.7.
No abstract available
Previous works have shown that explicit snippets relationship modeling can be helpful for feature learning on untrimmed action videos. However, the snippets relationship learning in these methods are far from optimal in that they failed to consider the valuable temporally coarse-grained features, learnable soft relationship weights, and separate relationship learning in different temporal orders. To address this issue, we proposed a novel SGC-Block for improved snippet relationship learning, which enables the temporally coarse-to-fine soft valued snippet-wise relationship learning in different temporal directions. The SGC-Block constructs the snippets graph and explicitly models the (1) temporal relations (TPR); (2) coarsegrained snippet-wise relations (CSR); (3) fine-grained snippet-wise relations (FSR); and an additional (4) adaptive relations (ADR). Especially, the novel CSR is inspired by the feature pyramid pooling structure to obtain the coarse feature presentations in the temporal dimension. Experimental results showed that our proposed approach outperforms most state-of-the-art methods on the THUMOS14 and ActivityNet-1.3 benchmarks.
No abstract available
Temporal action proposal generation is an important and challenging task in video understanding, which aims at detecting all temporal segments containing action in-stances of interest. The existing proposal generation approaches are generally based on pre-defined anchor windows or heuristic bottom-up boundary matching strategies. This paper presents a simple and efficient framework (RTD-Net) for direct action proposal generation, by re-purposing a Transformer-alike architecture. To tackle the essential visual difference between time and space, we make three important improvements over the original transformer detection framework (DETR). First, to deal with slowness prior in videos, we replace the original Transformer en-coder with a boundary attentive module to better capture long-range temporal information. Second, due to the ambiguous temporal boundary and relatively sparse annotations, we present a relaxed matching scheme to relieve the strict criteria of single assignment to each groundtruth. Finally, we devise a three-branch head to further improve the proposal confidence estimation by explicitly predicting its completeness. Extensive experiments on THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of RTD-Net, on both tasks of temporal action proposal generation and temporal action detection. Moreover, due to its simplicity in design, our framework is more efficient than previous proposal generation methods, without non-maximum suppression post-processing. The code and models are made available at https://github.com/MCG-NJU/RTD-Action.
Temporal action proposal generation aims to estimate temporal intervals of actions in untrimmed videos, which is a challenging yet important task in the video understanding field. The proposals generated by current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval owing to the lack of efficient temporal modeling and effective boundary context utilization. In this paper, we propose Temporal Context Aggregation Network (TCANet) to generate high-quality action proposals through "local and global" temporal context aggregation and complementary as well as progressive boundary refinement. Specifically, we first design a Local-Global Temporal Encoder (LGTE), which adopts the channel grouping strategy to efficiently encode both "local and global" temporal inter-dependencies. Furthermore, both the boundary and internal context of proposals are adopted for frame-level and segment-level boundary regressions, respectively. Temporal Boundary Regressor (TBR) is designed to combine these two regression granularities in an end-to-end fashion, which achieves the precise boundaries and reliable confidence of proposals through progressive refinement. Extensive experiments are conducted on three challenging datasets: HACS, ActivityNet-v1.3, and THUMOS-14, where TCANet can generate proposals with high precision and recall. By combining with the existing action classifier, TCANet can obtain remarkable temporal action detection performance compared with other methods. Not surprisingly, the proposed TCANet won the 1st place in the CVPR 2020 - HACS challenge leaderboard on temporal action localization task.
Self-supervised learning presents a remarkable performance to utilize unlabeled data for various video tasks. In this paper, we focus on applying the power of self-supervised methods to improve semi-supervised action proposal generation. Particularly, we design an effective Self-supervised Semi-supervised Temporal Action Proposal (SSTAP) framework. The SSTAP contains two crucial branches, i.e., temporal-aware semi-supervised branch and relation-aware self-supervised branch. The semi-supervised branch improves the proposal model by introducing two temporal perturbations, i.e., temporal feature shift and temporal feature flip, in the mean teacher framework. The self-supervised branch defines two pretext tasks, including masked feature reconstruction and clip-order prediction, to learn the relation of temporal clues. By this means, SSTAP can better explore unlabeled videos, and improve the discriminative abilities of learned action features. We extensively evaluate the proposed SSTAP on THUMOS14 and ActivityNet v1.3 datasets. The experimental results demonstrate that SSTAP significantly outperforms state-of-the-art semi-supervised methods and even matches fully-supervised methods. Code is available at https://github.com/wangxiang1230/SSTAP.
Action proposal generation is to locate the temporal boundaries of action segments in a video. In this paper, LGCT network is proposed to solve the problems of dense action segments, strong correlation of action and fuzzy boundary in medical operation video. LGCT is based on the DETR with two improvements: (1) To address the issue of low proposal recall accuracy caused by weak contextual interactivity in the temporal dimension of video features, LGBlock (Local and Global Block) is introduced in the encoding position of the DETR to establish the association of the temporal context; (2) To address the issue of incomplete action proposals caused by blurred action boundaries, a complete prediction head based on background context is proposed, which introduces adjacent background information of proposals to predict complete scores to stabilize the pipeline of proposal generation. This paper conducts experimental exploration on the THUMOS14, ActivityNet1.3 and Medical-74 datasets. The entire model can be trained end-to-end and the generated proposals do not require any post-processing operations. The test metrics AR@500 for proposals reaches 62.29% and 75.31% in THUMOS14 and Medical-74 respectively, and AR@1 reaches 33.13% in ActivityNet1.3. Meanwhile, after introducing post-processing operations, AR@500 reaches 62.96% and 75.40% in THUMOS14 and Medical-74, and AR@1 reaches 33.21% in ActivityNet1.3.
Temporal action proposal generation aims at localizing the temporal segments containing human actions in a video. This work proposes a centerness-aware network (CAN), which is a novel one-stage approach intended to generate action proposals as keypoint triplets. A keypoint triplet contains two boundary points (starting and ending) and one center point. Specifically, we evaluate the probabilities of each temporal location in the video whether it is at the boundaries or the center region of ground truth action proposals. CAN optimizes the predicted boundary points interactively in a bidirectional adaptation form by exploiting the dependencies among them. Furthermore, to accurately locate the center points of action proposals with different time spans, temporal feature pyramids are utilized to incorporate multi-scale information explicitly. Using the generated three keypoints, CAN efficiently retrieves temporal proposals by grouping keypoints into triplets if they are geometrically aligned. Experiments show that CAN achieves the state-of-the-art performance on the public THUMOS-14 and ActivityNet-1.3 datasets. Moreover, further experiments demonstrate that by applying action classifiers on proposals generated by CAN, our method achieves the state-of-the-art performance in temporal action localization.
Temporal action proposal generation is an essential and challenging task in video understanding, which aims to locate the temporal intervals that likely contain the actions of interest. Although great progress has been made, the problem is still far from being well solved. In particular, prevalent methods can handle well only the local dependencies (i.e., short-term dependencies) among adjacent frames but are generally powerless in dealing with the global dependencies (i.e., long-term dependencies) between distant frames. To tackle this issue, we propose CLGNet, a novel Collaborative Local-Global Learning Network for temporal action proposal. The majority of CLGNet is an integration of Temporal Convolution Network and Bidirectional Long Short-Term Memory, in which Temporal Convolution Network is responsible for local dependencies while Bidirectional Long Short-Term Memory takes charge of handling the global dependencies. Furthermore, an attention mechanism called the background suppression module is designed to guide our model to focus more on the actions. Extensive experiments on two benchmark datasets, THUMOS’14 and ActivityNet-1.3, show that the proposed method can outperform state-of-the-art methods, demonstrating the strong capability of modeling the actions with varying temporal durations.
Temporal action detection, a critical task in video activity understanding, is typically divided into two stages: proposal generation and classification. However, most existing methods overlook the importance of information transfer among proposals during classification, often treating each proposal in isolation, which hampers accurate label prediction. In this article, we propose a novel method for inferring semantic relationships both within and between action proposals, guiding the fusion of action proposal features accordingly. Building on this approach, we introduce the Proposal Semantic Relationship Graph Network (PSRGN), an end-to-end model that leverages intra-proposal semantic relationship graphs to extract cross-scale temporal context and an inter-proposal semantic relationship graph to incorporate complementary neighboring information, significantly improving proposal feature quality and overall detection performance. This is the first method to apply graph structure learning in temporal action detection, adaptively constructing the inter-proposal semantic graph. Extensive experiments on two datasets demonstrate the effectiveness of our approach, achieving state-of-the-art (SOTA). Code and results are available at http://github.com/Riiick2011/PSRGN.
Temporal action proposal generation in an untrimmed video is very challenging, and comprehensive context exploration is critically important to generate accurate candidates of action instances. This paper proposes a Temporal-aware Attention Network (TAN) that localizes context-rich proposals by enhancing the temporal representations of boundaries and proposals. Firstly, we pinpoint that obtaining precise location information of action instances needs to consider long-distance temporal contexts. To this end, we propose a Global-Aware Attention (GAA) module for boundary-level interaction. Specifically, we introduce two novel gating mechanisms into the top-down interaction structure to incorporate multi-level semantics into video features effectively. Secondly, we design an efficient task-specific Adaptive Temporal Interaction (ATI) module to learn proposal associations. TAN enhances proposal-level contextual representations in a wide range by utilizing multi-scale interaction modules. Extensive experiments on the ActivityNet-1.3 and THUMOS-14 demonstrate the effectiveness of our proposed method, e.g., TAN achieves 73.43% in AR@1000 on THUMOS-14 and 69.01% in AUC on ActivityNet-1.3. Moreover, TAN significantly improves temporal action detection performance when equipped with existing action classification frameworks.
No abstract available
Weakly-supervised temporal action localization aims to detect temporal intervals of actions in arbitrarily long untrimmed videos with only video-level annotations. Owing to label sparsity, learning action consistency is intractable. In this paper, we assume that frames with similar representations in a given video should be considered as the same action. To this end, we develop a query-based contrastive learning paradigm to ensure action-semantic consistency. This mechanism encourages normalized embeddings with the same class to be pulled closer together, while embeddings from different classes are repelled apart. Besides, we design a two-branch framework, consisting of a class-aware branch and a class-agnostic branch, to learn salient features and fine-grained clues respectively. To further guarantee the action-semantic consistency of the two branches, unlike previous methods that handle each branch independently, we model the relationship between the two branches to avoid unreasonable predictions. Finally, the proposed model demonstrates superior performance over existing methods on the publicly available THUMOS-14 and ActivityNet-1.3 datasets. Substantial experiments and ablation studies also demonstrate the effectiveness of our model.
Point-level weakly-supervised temporal action localization (P-WSTAL) aims to localize temporal extents of action instances and identify the corresponding categories with only a single point label for each action instance for training. Due to the sparse frame-level annotations, most existing models are in the localization-by-classification pipeline. However, there exist two major issues in this pipeline: large intra-action variation due to task gap between classification and localization and noisy classification learning caused by unreliable pseudo training samples. In this paper, we propose a novel framework CRRC-Net, which introduces a co-supervised feature learning module and a probabilistic pseudo label mining module, to simultaneously address the above two issues. Specifically, the co-supervised feature learning module is applied to exploit the complementary information in different modalities for learning more compact feature representations. Furthermore, the probabilistic pseudo label mining module utilizes the feature distances from action prototypes to estimate the likelihood of pseudo samples and rectify their corresponding labels for more reliable classification learning. Comprehensive experiments are conducted on different benchmarks and the experimental results show that our method achieves favorable performance with the state-of-the-art.
Weakly-supervised temporal action localization focuses on locating action intervals when merely video-level supervised signals are available. Conventional methods mostly rely on the attention framework, which generates a set of scores indicating the confidence that the video snippet belongs to the foreground, the background, and the context, respectively. However, such methods fail to consider the structural properties of snippet-level features when generating attention scores, and these structural properties are critical for capturing contextual information in temporal tasks. To this end, we propose a hierarchical attention generation mechanism with multi-scale fusion strategies to model such structural information. Besides, to resolve action-context confusion issues that are quite intractable in weakly-supervised action localization tasks, metric learning is further introduced into our framework to suppress context features from approaching action features, while encouraging them to be close to background features. Finally, our model is evaluated on THUMOS14 and ActivityNet1.3 benchmarks, and the results demonstrate that the proposed approach achieves desirable performance.
Weakly-supervised action localization aims to recognize and localize action instancese in untrimmed videos with only video-level labels. Most existing models rely on multiple instance learning(MIL), where the predictions of unlabeled instances are supervised by classifying labeled bags. The MIL-based methods are relatively well studied with cogent performance achieved on classification but not on localization. Generally, they locate temporal regions by the video-level classification but overlook the temporal variations of feature semantics. To address this problem, we propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics. Specifically, our model entails two components, the first is an unsupervised change-points detection module that detects change-points by learning the latent representations of video features in a temporal hierarchy based on their rates of change, and the second is an attention-based classification model that selects the change-points of the foreground as the boundaries. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark datasets, THUMOS-14 and ActivityNet-v1.3. The experiments show that our method outperforms current state-of-the-art methods, and even achieves comparable performance with fully-supervised methods.
Weakly supervised action localization is a challenging problem in video understanding and action recognition. Existing models usually formulate the training process as direct classification using video-level supervision. They tend to only locate the most discriminative parts of action instances and produce temporally incomplete detection results. A natural solution for this problem, the adversarial erasing strategy, is to remove such parts from training so that models can attend to complementary parts. Previous works do it in an offline and heuristic way. They adopt a multi-stage pipeline, where discriminative regions are determined and erased under the guidance of detection results from last stage. Such a pipeline can be both ineffective and inefficient, possibly hindering the overall performance. On the contrary, we combine adversarial erasing with dropout mechanism and propose a Temporal Dropout Module that learns where to remove in a data-driven and online manner. This plug-and-play module is trained without iterative stages, which not only simplifies the pipeline but also makes the regularization during training easier and more adaptive. Experiments show that the proposed method outperforms previous erasing-based methods by a large margin. More importantly, it achieves universal improvement when plugged into various direct classification methods and obtains state-of-the-art performance.
Weakly-supervised temporal action localization (WTAL) intends to detect action instances with only weak supervision, e.g., video-level labels. The current~\textit{de facto} pipeline locates action instances by thresholding and grouping continuous high-score regions on temporal class activation sequences. In this route, the capacity of the model to recognize the relationships between adjacent snippets is of vital importance which determines the quality of the action boundaries. However, it is error-prone since the variations between adjacent snippets are typically subtle, and unfortunately this is overlooked in the literature. To tackle the issue, we propose a novel WTAL approach named Convex Combination Consistency between Neighbors (C$^3$BN). C$^3$BN consists of two key ingredients: a micro data augmentation strategy that increases the diversity in-between adjacent snippets by convex combination of adjacent snippets, and a macro-micro consistency regularization that enforces the model to be invariant to the transformations~\textit{w.r.t.} video semantics, snippet predictions, and snippet representations. Consequently, fine-grained patterns in-between adjacent snippets are enforced to be explored, thereby resulting in a more robust action boundary localization. Experimental results demonstrate the effectiveness of C$^3$BN on top of various baselines for WTAL with video-level and point-level supervisions. Code is at https://github.com/Qinying-Liu/C3BN.
Point-level weakly-supervised temporal action localization aims to accurately recognize and localize action segments in untrimmed videos, using only point-level annotations during training. Current methods primarily focus on mining sparse pseudo-labels and generating dense pseudo-labels. However, due to the sparsity of point-level labels and the impact of scene information on action representations, the reliability of dense pseudo-label methods still remains an issue. In this paper, we propose a point-level weakly-supervised temporal action localization method based on local representation enhancement and global temporal optimization. This method comprises two modules that enhance the representation capacity of action features and improve the reliability of class activation sequence classification, thereby enhancing the reliability of dense pseudo-labels and strengthening the model’s capability for completeness learning. Specifically, we first generate representative features of actions using pseudo-label feature and calculate weights based on the feature similarity between representative features of actions and segments features to adjust class activation sequence. Additionally, we maintain the fixed-length queues for annotated segments and design a action contrastive learning framework between videos. The experimental results demonstrate that our modules indeed enhance the model’s capability for comprehensive learning, particularly achieving state-of-the-art results at high IoU thresholds.
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
No abstract available
Weakly supervised temporal action localization is a challenging vision task due to the absence of ground-truth temporal locations of actions in the training videos. With only video-level supervision during training, most existing methods rely on a Multiple Instance Learning (MIL) framework to predict the start and end frame of each action category in a video. However, the existing MIL-based approach has a major limitation of only capturing the most discriminative frames of an action, ignoring the full extent of an activity. Moreover, these methods cannot model background activity effectively, which plays an important role in localizing foreground activities. In this paper, we present a novel framework named HAM-Net with a hybrid attention mechanism which includes temporal soft, semi-soft and hard attentions to address these issues. Our temporal soft attention module, guided by an auxiliary background class in the classification module, models the background activity by introducing an ``action-ness'' score for each video snippet. Moreover, our temporal semi-soft and hard attention modules, calculating two attention scores for each video snippet, help to focus on the less discriminative frames of an action to capture the full action boundary. Our proposed approach outperforms recent state-of-the-art methods by at least 2.2% mAP at IoU threshold 0.5 on the THUMOS14 dataset, and by at least 1.3% mAP at IoU threshold 0.75 on the ActivityNet1.2 dataset.
With the explosive growth of videos, weakly-supervised temporal action localization (WS-TAL) task has become a promising research direction in pattern analysis and machine learning. WS-TAL aims to detect and localize action instances with only video-level labels during training. Modern approaches have achieved impressive progress via powerful deep neural networks. However, robust and reliable WS-TAL remains challenging and underexplored due to considerable uncertainty caused by weak supervision, noisy evaluation environment, and unknown categories in the open world. To this end, we propose a new paradigm, named vectorized evidential learning (VEL), to explore local-to-global evidence collection for facilitating model performance. Specifically, a series of learnable meta-action units (MAUs) are automatically constructed, which serve as fundamental elements constituting diverse action categories. Since the same meta-action unit can manifest as distinct action components within different action categories, we leverage MAUs and category representations to dynamically and adaptively learn action components and action-component relations. After performing uncertainty estimation at both category-level and unit-level, the local evidence from action components is accumulated and optimized under the Subject Logic theory. Extensive experiments on the regular, noisy, and open-set settings of three popular benchmarks show that VEL consistently obtains more robust and reliable action localization performance than state-of-the-arts.
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision. Previous works use the appearance and motion features extracted from pre-trained feature encoder directly,e.g., feature concatenation or score-level fusion. In this work, we argue that the features extracted from the pre-trained extractors,e.g., I3D, which are trained for trimmed video action classification, but not specific for WS-TAL task, leading to inevitable redundancy and sub-optimization. Therefore, the feature re-calibration is needed for reducing the task-irrelevant information redundancy. Here, we propose a cross-modal consensus network(CO2-Net) to tackle this problem. In CO2-Net, we mainly introduce two identical proposed cross-modal consensus modules (CCM) that design a cross-modal attention mechanism to filter out the task-irrelevant information redundancy using the global information from the main modality and the cross-modal local information from the auxiliary modality. Moreover, we further explore inter-modality consistency, where we treat the attention weights derived from each CCM as the pseudo targets of the attention weights derived from another CCM to maintain the consistency between the predictions derived from two CCMs, forming a mutual learning manner. Finally, we conduct extensive experiments on two commonly used temporal action localization datasets, THUMOS14 and ActivityNet1.2, to verify our method, which we achieve state-of-the-art results. The experimental results show that our proposed cross-modal consensus module can produce more representative features for temporal action localization.
No abstract available
No abstract available
No abstract available
Weakly-supervised temporal action localization (WTAL) is a problem learning an action localization model with only video-level labels available. In recent years, many WTAL methods have developed. However, hard-to-predict snippets near action boundaries are often not considered in these existing approaches, causing action incompleteness and action over-complete issues. To solve these issues, in this work, an end-to-end snippets relation and hard-snippets mask network (SRHN) is proposed. Specifically, a hard-snippets mask module is applied to mask the hard-to-predict snippets adaptively, and in this way, the trained model focuses more on those snippets with low uncertainty. Then, a snippets relation module is designed to capture the relationship among snippets and can make hard-to-predict snippets easy to predict by aggregating the information of multiple temporal receptive fields. Finally, a snippet enhancement loss is further developed to reduce the action probabilities that are not present in videos for hard-to-predict snippets and other snippets, enlarging the action probabilities that exist in videos. Extensive experiments on THUMOS14, ActivityNet1.2, and ActivityNet1.3 datasets demonstrate the effectiveness of the SRHN method.
In this study, we propose a single-stage model for video action detection and a real-world action detection dataset POWER collected from real power operation scenarios. While previous studies have made significant progress in overall classification and localization performance, they often struggle with the actions that have short duration, hindering the application of these approaches. To address this, we introduce the Cross-scale Selective Context Aggregation Network (CSCAN), which focuses on improving the detection of short actions. This network integrates three key components: 1) a cross-scale feature conduction structure combined with a tailored alignment mechanism; 2) a selective context aggregation module based on gating mechanism; and 3) an effective scale-invariant consistency training strategy to enable the model to learn scale-invariant action representation. We evaluated our method on the self-collected dataset POWER and on the most widely used action detection benchmarks THUMOS14 and ActivityNet v1.3. The extensive results show that our model outperforms other approaches, especially in detecting real-world short actions, demonstrating the effectiveness of our approach.
This paper proposes a novel multi-modal transformer network for detecting actions in untrimmed videos. To enrich the action features, our transformer network utilizes a new multi-modal attention mechanism that computes the correlations between different spatial and motion modalities combinations. Exploring such correlations for actions has not been attempted previously. To use the motion and spatial modality more effectively, we suggest an algorithm that corrects the motion distortion caused by camera movement. Such motion distortion, common in untrimmed videos, severely reduces the expressive power of motion features such as optical flow fields. Our proposed algorithm outperforms the state-of-the-art methods on two public benchmarks, THUMOS14 and ActivityNet. We also conducted comparative experiments on our new instructional activity dataset, including a large set of challenging classroom videos captured from elementary schools.
Temporal Action Detection(TAD) is a challenge task in video understanding. The current methods mainly use global features for boundary matching or predefine all possible proposals, while ignoring long context information and local action boundary features, resulting in the decline of detection accuracy. To fill this gap, we propose a Dilation Location Network (DL-Net) model to generate more precise action boundaries by enhancing boundary features of actions and aggregating long contextual information in this paper. Specifically, we design the boundary feature enhancement (BFE) block, which strengthens the actions boundary feature and fuses the similar feature of the different channels by pooling and channel squeezing. Meanwhile, in action location, we design multiple dilated convolutional structures to aggregate long contextual information of time point/interval. We conduct extensive experiments on ActivityNet-1.3 and Thumos14 show that DL-Net is capable of enhancing action boundary features and aggregating long contextual information effectively.
Existing action detection approaches do not take spatio-temporal structural relationships of action clips into account, which leads to a low applicability in real-world scenarios and can benefit detecting if exploited. To this end, this paper proposes to formulate the action detection problem as a reinforcement learning process which is rewarded by observing both the clip sampling and classification results via adjusting the detection schemes. In particular, our framework consists of a heterogeneous graph convolutional network to represent the spatio-temporal features capturing the inherent relation, a policy network which determines the probabilities of a predefined action sampling spaces, and a classification network for action clip recognition. We accomplish the network joint learning by considering the temporal intersection over union and Euclidean distance between detected clips and ground-truth. Experiments on ActivityNet v1.3 and THUMOS14 demonstrate our method.
Temporal action detection aims to correctly predict the categories and temporal intervals of actions in an untrimmed video by using only video-level labels, which is a basic but challenging task in video understanding. Inspired by the work of Sparse R-CNN object detection, we present a purely sparse method in temporal action detection. In our method, a fixed sparse set of learnable temporal proposals, total length of $\mathbf{N}$ (e.g.50), are provided to dynamic action interaction head to perform classification and localization. Sparse temporal action detection method completely avoids all efforts related to temporal candidates design and many- to-one label assignment. More importantly, final predictions are directly output without non-maximum suppression post-procedure. Extensive experiments show that our method achieves state-of-the-art performance for both action proposal and localization on THUMOS14 detection benchmark and competitive performance on ActivityNet-l.3challenge.
Current one-stage action detection methods, which simultaneously predict action boundaries and the corresponding class, do not estimate or use a measure of confidence in their boundary predictions, which can lead to inaccurate boundaries. We incorporate the estimation of boundary confidence into one-stage anchor-free detection, through an additional prediction head that predicts the refined boundaries with higher confidence. We obtain state-of-the-art performance on the challenging EPICKITCHENS-100 action detection as well as the standard THUMOS14 action detection benchmarks, and achieve improvement on the ActivityNet-1.3 benchmark.
Proposal generation is a fundamental yet challenging task for two-stage temporal action detection pipelines. The task aims at predicting starting and ending boundaries of segments in realistic video sequences and action recognition methods cannot be directly applied to such videos due to their untrimmed nature. Most state-of-the-art models rely on temporal convolutional neural networks with pre-defined anchor segments. By eliminating anchors, we propose a lighter end-to-end trainable Anchor-Free Multiscale Transformer-based Generator (AMTG) model using local clues via video snippets. To improve effectiveness for temporal evaluation, we apply multiscale Transformer encoders to sequences with a bi-directional mask extension that simultaneously predicts boundary distances with uncertainties and various snippet-based local scores. Later, our model integrates local predictions to generate proposal candidates using the proposed scoring function. Experiments on the THUMOS14 and ActivityNet-1.3 benchmarks demonstrate the effectiveness of AMTG for the temporal proposal generation task.
Self-attention based Transformer models have demonstrated impressive results for image classification and object detection, and more recently for video understanding. Inspired by this success, we investigate the application of Transformer networks for temporal action localization in videos. To this end, we present ActionFormer -- a simple yet powerful model to identify actions in time and recognize their categories in a single shot, without using action proposals or relying on pre-defined anchor windows. ActionFormer combines a multiscale feature representation with local self-attention, and uses a light-weighted decoder to classify every moment in time and estimate the corresponding action boundaries. We show that this orchestrated design results in major improvements upon prior works. Without bells and whistles, ActionFormer achieves 71.0% mAP at tIoU=0.5 on THUMOS14, outperforming the best prior model by 14.1 absolute percentage points. Further, ActionFormer demonstrates strong results on ActivityNet 1.3 (36.6% average mAP) and EPIC-Kitchens 100 (+13.5% average mAP over prior works). Our code is available at http://github.com/happyharrycn/actionformer_release.
Detecting actions as they occur is essential for applications like video surveillance, autonomous driving, and human-robot interaction. Known as online action detection, this task requires classifying actions in streaming videos, handling background noise, and coping with incomplete actions. Transformer architectures are the current state-of-the-art, yet the potential of recent advancements in computer vision, particularly vision-language models (VLMs), remains largely untapped for this problem, partly due to high computational costs. In this paper, we introduce TOAD: A Text-driven Online Action Detection architecture that supports zero-shot and few-shot learning. TOAD leverages CLIP (Contrastive Language-Image Pretraining) textual embeddings, enabling efficient use of VLMs without significant computational overhead. Our model achieves 82.46% mAP on the THUMOS14 dataset, outperforming existing methods, and sets new baselines for zero-shot and few-shot performance on the THUMOS14 and TVSeries datasets.
We propose action-agnostic point-level (AAPL) supervision for temporal action detection to achieve accurate action instance detection with a lightly annotated dataset. In the proposed scheme, a small portion of video frames is sampled in an unsupervised manner and presented to human annotators, who then label the frames with action categories. Unlike point-level supervision, which requires annotators to search for every action instance in an untrimmed video, frames to annotate are selected without human intervention in AAPL supervision. We also propose a detection model and learning method to effectively utilize the AAPL labels. Extensive experiments on the variety of datasets (THUMOS'14, FineAction, GTEA, BEOID, and ActivityNet 1.3) demonstrate that the proposed approach is competitive with or outperforms prior methods for video-level and point-level supervision in terms of the trade-off between the annotation cost and detection performance.
Due to the variability of video length and action duration, the temporal action detection task faces the problem of blurred action boundaries that are difficult to capture accurately. To alleviate this problem, this paper proposes a Frequency Attention Mechanism (FAM) that adaptively models the frequency dependencies between video signal channels, enabling the model to better understand the frequency variations in the video and to handle the complexity of different action durations, thus enhancing the sensitivity and discriminative power of the action boundaries, and still providing powerful action recognition even in long video sequences Capabilities. Through comprehensive experimental validation on a series of representative benchmark datasets (e.g. THUMOS14 and ActivityNet1.3), our approach demonstrates significant performance improvement.
Most online action detection methods focus on solving a (K + 1) classification problem, where the additional category represents the ‘background’ class. However, training on the ‘background’ class and managing data imbalance are common challenges in online action detection. To address these issues, we propose a framework for online action detection by incorporating an additional pathway between the feature extractor and online action detection model. Specifically, we present one configuration that retains feature distinctions for fusion with the final decision from the Long Short-Term Transformer (LSTR), enhancing its performance in the (K + 1) classification. Experimental results show that the proposed method achieves an accuracy of 71.2% in mean Average Precision (mAP) on the Thumos14 dataset, outperforming the 69.5% achieved by the original LSTR method.
False predictions often hampered human action recognition in videos, reducing the reliability of detection models. This paper presents a novel approach that integrates Video Vision Transformer (ViViT) and YOLOv8 to minimize false alarms in action detection. YOLOv8 detects human subjects within video segments, while ViViT reclassifies these segments to reduce false positives. We validate our method on two benchmark datasets: THUMOS14 and EPIC-Kitchen. Our experiments substantially reduce false positives, improving model performance without sacrificing accuracy. Specifically, our framework reduces false predictions by 45.8This approach enhances the precision of action detection models, offering a more robust and reliable solution for practical applications such as video surveillance and human activity analysis in untrimmed videos.
In this paper, we explore the problem of Online Action Detection (OAD), where the task is to detect ongoing actions from streaming videos without access to video frames in the future. Existing methods achieve good detection performance by capturing long-range temporal structures. However, a major challenge of this task is to detect actions at a specific time that arrive with insufficient observations. In this work, we utilize the additional future frames available at the training phase and propose a novel Knowledge Distillation (KD) framework for OAD, where a teacher network looks at more frames from the future and the student network distills the knowledge from the teacher for detecting ongoing actions from the observation up to the current frames. Usually, the conventional KD regards a high-level teacher network (i.e., the network after the last training iteration) to guide the student network throughout all training iterations, which may result in poor distillation due to the large knowledge gap between the high-level teacher and the student network at early training iterations. To remedy this, we propose a novel progressive knowledge distillation from different levels of teachers (PKD-DLT) for OAD, where in addition to a high-level teacher, we also generate several low- and middle-level teachers, and progressively transfer the knowledge (in the order of low- to high-level) to the student network throughout training iterations, for effective distillation. Evaluated on two challenging datasets THUMOS14 and TVSeries, we validate that our PKD-DLT is an effective teacher-student learning paradigm, which can be a plug-in to improve the performance of the existing OAD models and achieve a state-of-the-art.
We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD in short. Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video. This presents a generative modeling perspective, against previous discriminative learning manners. This capability is achieved by first diffusing the ground-truth proposals to random ones (i.e., the forward/noising process) and then learning to reverse the noising process (i.e., the backward/denoising process). Concretely, we establish the denoising process in the Transformer decoder (e.g., DETR) by introducing a temporal location query design with faster convergence in training. We further propose a cross-step selective conditioning algorithm for inference acceleration. Extensive evaluations on ActivityNet and THUMOS show that our DiffTAD achieves top performance compared to previous art alternatives. The code is available at https://github.com/sauradip/DiffusionTAD.
Online action detection plays a vital role in video action understanding and can be widely used in various video analysis applications. This task aims to detect actions at the current moment within long untrimmed video streams. However, accurately identifying action-background transitions that are ambiguous in terms of time during detection can be challenging due to the similarity between the action and background clips, adding to the difficulty in finding a suitable division between them. To address this issue, we propose a hard video clip mining method based on deep metric learning for online action detection named HCM. The HCM method first selects video clips that are hard to distinguish to determine the optimization objects. Then, a hard clip mining loss is adopted to push the features toward the centers of the categories to which they belong and away from others. Furthermore, we introduce an intra-class feature compaction loss to constrain the divergence of action features, ensuring the stability of their distribution. We evaluated the proposed method on two challenging online action detection datasets, THUMOS14 and TVSeries. The results show that HCM is effective and efficient in online action detection and action anticipation tasks.
In the task of temporal action localization of ActivityNet-1.3 datasets, we propose to locate the temporal boundaries of each action and predict action class in untrimmed videos. We first apply VideoSwinTransformer as feature extractor to extract different features. Then we apply a unified network following Faster-TAD to simultaneously obtain proposals and semantic labels. Last, we ensemble the results of different temporal action detection models which complement each other. Faster-TAD simplifies the pipeline of TAD and gets remarkable performance, obtaining comparable results as those of multi-step approaches.
Temporal Action Detection (TAD) is fundamental yet challenging for real-world video applications. Leveraging the unique benefits of transformers, various DETR-based approaches have been adopted in TAD. However, it has recently been identified that the attention collapse in self-attention causes the performance degradation of DETR for TAD. Building upon previous research, this paper newly addresses the attention collapse problem in cross-attention within DETR-based TAD methods. Moreover, our findings reveal that cross-attention exhibits patterns distinct from predictions, indicating a short-cut phenomenon. To resolve this, we propose a new framework, Prediction-Feedback DETR (Pred-DETR), which utilizes predictions to restore the collapse and align the cross- and self-attention with predictions. Specifically, we devise novel prediction-feedback objectives using guidance from the relations of the predictions. As a result, Pred-DETR significantly alleviates the collapse and achieves state-of-the-art performance among DETR-based methods on various challenging benchmarks including THUMOS14, ActivityNet-v1.3, HACS, and FineAction.
Human body actions are an important form of non-verbal communication in social interactions. This paper specifically focuses on a subset of body actions known as micro-actions, which are subtle, low-intensity body movements with promising applications in human emotion analysis. In real-world scenarios, human micro-actions often temporally co-occur, with multiple micro-actions overlapping in time, such as concurrent head and hand movements. However, current research primarily focuses on recognizing individual micro-actions while overlooking their co-occurring nature. To address this gap, we propose a new task named Multi-label Micro-Action Detection (MMAD), which involves identifying all micro-actions in a given short video, determining their start and end times, and categorizing them. Accomplishing this requires a model capable of accurately capturing both long-term and short-term action relationships to detect multiple overlapping micro-actions. To facilitate the MMAD task, we introduce a new dataset named Multi-label Micro-Action-52 (MMA-52) and propose a baseline method equipped with a dual-path spatial-temporal adapter to address the challenges of subtle visual change in MMAD. We hope that MMA-52 can stimulate research on micro-action analysis in videos and prompt the development of spatio-temporal modeling in human-centric video understanding. The proposed MMA-52 dataset is available at: https://github.com/VUT-HFUT/Micro-Action.
Existing temporal action detection (TAD) methods rely on generating an overwhelmingly large number of proposals per video. This leads to complex model designs due to proposal generation and/or per-proposal action instance evaluation and the resultant high computational cost. In this work, for the first time, we propose a proposal-free Temporal Action detection model with Global Segmentation mask (TAGS). Our core idea is to learn a global segmentation mask of each action instance jointly at the full video length. The TAGS model differs significantly from the conventional proposal-based methods by focusing on global temporal representation learning to directly detect local start and end points of action instances without proposals. Further, by modeling TAD holistically rather than locally at the individual proposal level, TAGS needs a much simpler model architecture with lower computational cost. Extensive experiments show that despite its simpler design, TAGS outperforms existing TAD methods, achieving new state-of-the-art performance on two benchmarks. Importantly, it is ~ 20x faster to train and ~1.6x more efficient for inference. Our PyTorch implementation of TAGS is available at https://github.com/sauradip/TAGS .
Temporal action detection (TAD) aims to localize the start and end frames of actions in untrimmed videos, which is a challenging task due to the similarity of adjacent frames and the ambiguity of action boundaries. Previous methods often generate coarse proposals first and then perform proposal-based refinement, which is coupled with prior action detectors and leads to proposal-oriented offsets. However, this paradigm increases the training difficulty of the TAD model and is heavily influenced by the quantity and quality of the proposals. To address the above issues, we decouple the refinement process from conventional TAD methods and propose a learnable, proposal-free refinement method for fine boundary localization, named RefineTAD. We first propose a multi-level refinement module to generate multi-scale boundary offsets, score offsets and boundary-aware probability at each time point based on the feature pyramid. Then, we propose an offset focusing strategy to progressively refine the predicted results of TAD models in a coarse-to-fine manner with our multi-scale offsets. We perform extensive experiments on three challenging datasets and demonstrate that our RefineTAD significantly improves the state-of-the-art TAD methods with minimal computational overhead.
The task of action detection aims at deducing both the action category and localization of the start and end moment for each action instance in a long, untrimmed video. While vision Transformers have driven the recent advances in video understanding, it is non-trivial to design an efficient architecture for action detection due to the prohibitively expensive self-attentions over a long sequence of video clips. To this end, we present an efficient hierarchical Spatio-Temporal Pyramid Transformer (STPT) for action detection, building upon the fact that the early self-attention layers in Transformers still focus on local patterns. Specifically, we propose to use local window attention to encode rich local spatio-temporal representations in the early stages while applying global attention modules to capture long-term space-time dependencies in the later stages. In this way, our STPT can encode both locality and dependency with largely reduced redundancy, delivering a promising trade-off between accuracy and efficiency. For example, with only RGB input, the proposed STPT achieves 53.6% mAP on THUMOS14, surpassing I3D+AFSD RGB model by over 10% and performing favorably against state-of-the-art AFSD that uses additional flow features with 31% fewer GFLOPs, which serves as an effective and efficient end-to-end Transformer-based framework for action detection.
No abstract available
Human activity recognition (HAR) based on skeleton data that can be extracted from videos (Kinect for example) , or provided by a depth camera is a time series classification problem, where handling both spatial and temporal dependencies is a crucial task, in order to achieve a good recognition. In the online human activity recognition, identifying the beginning and end of an action is an important element, that might be difficult in a continuous data flow. In this work, we present a 3D skeleton data encoding method to generate an image that preserves the spatial and temporal dependencies existing between the skeletal joints.To allow online action detection we combine this encoding system with a sliding window on the continous data stream. By this way, no start or stop timestamp is needed and the recognition can be done at any moment. A deep learning CNN algorithm is used to achieve actions online detection.
Temporal Action Detection(TAD) is a crucial but challenging task in video understanding.It is aimed at detecting both the type and start-end frame for each action instance in a long, untrimmed video.Most current models adopt both RGB and Optical-Flow streams for the TAD task. Thus, original RGB frames must be converted manually into Optical-Flow frames with additional computation and time cost, which is an obstacle to achieve real-time processing. At present, many models adopt two-stage strategies, which would slow the inference speed down and complicatedly tuning on proposals generating.By comparison, we propose a one-stage anchor-free temporal localization method with RGB stream only, in which a novel Newtonian Mechanics-MLP architecture is established. It has comparable accuracy with all existing state-of-the-art models, while surpasses the inference speed of these methods by a large margin. The typical inference speed in this paper is astounding 4.44 video per second on THUMOS14. In applications, because there is no need to convert optical flow, the inference speed will be faster.It also proves that MLP has great potential in downstream tasks such as TAD. The source code is available at https://github.com/BonedDeng/TadML
Temporal action detection aims to judge whether there existing a certain number of action instances in a long untrimmed videos and to locate the start and end time of each action. Even though the existing action detection methods have shown promising results in recent years with the widespread application of Convolutional Neural Network (CNN), it is still a challenging problem to accurately locate each action segment while ensuring real-time performance. In order to achieve a good tradeoff between detection efficiency and accuracy, we present a coarse-to-fine hierarchical temporal action detection method by using multi-scale sliding window mechanism. Since the complexity of the convolution operator is proportional to the number and the size of the input video clips, the idea of our proposed method is to first determine candidate action proposals and then perform the detection task on these candidate action proposals only with a view to reducing the overall complexity of the detection method. By making full use of the spatio-temporal information of video clips, a lightweight 3D-CNN classifier is first used to quickly determine whether the video clip is a candidate action proposal, avoiding the re-detection of a large number of non-action video clips by the heavyweight deep network. A heavyweight detector is designed to further improve the accuracy of action positioning by considering both boundary regression loss and category loss in the target loss function. In addition, the Non-Maximum Suppression (NMS) is performed to eliminate redundant detection results among the overlapping proposals. The mean Average Precision (mAP) is 40.6%, 51.7% and 20.4% on THUMOS14, ActivityNet and MPII Cooking dataset when the Intersection-over-Union (tIoU) threshold is set to 0.5, respectively. Experimental results show the superior performance of the proposed method on three challenging temporal activity detection datasets while achieving real-time speed. At the same time, our method can generate proposals for unseen action classes with high recalls.
The detection and recognition of distracted driving behaviors has emerged as a new vision task with the rapid development of computer vision, which is considered as a challenging temporal action localization (TAL) problem in computer vision. The primary goal of temporal localization is to determine the start and end time of actions in untrimmed videos. Currently, most state-of-the-art temporal localization methods adopt complex architectures, which are cumbersome and time-consuming. In this paper, we propose a robust and efficient two-stage framework for distracted behavior classification-localization based on the sliding window approach, which is suitable for untrimmed naturalistic driving videos. To address the issues of high similarity among different behaviors and interference from background classes, we propose a multi-view fusion and adaptive thresholding algorithm, which effectively reduces missing detections. To address the problem of fuzzy behavior boundary localization, we design a post-processing procedure that achieves fine localization from coarse localization through post connection and candidate behavior merging criteria. In the AICITY2024 Task3 TestA, our method performs well, achieving Average Intersection over Union(AIOU) of 0.6080 and ranking eighth in AICITY2024 Task3. Our code will be released in the near future.
This technical report analyzes an egocentric video action detection method we used in the 2021 EPIC-KITCHENS-100 competition hosted in CVPR2021 Workshop. The goal of our task is to locate the start time and the end time of the action in the long untrimmed video, and predict action category. We adopt sliding window strategy to generate proposals, which can better adapt to short-duration actions. In addition, we show that classification and proposals are conflict in the same network. The separation of the two tasks boost the detection performance with high efficiency. By simply employing these strategy, we achieved 16.10\% performance on the test set of EPIC-KITCHENS-100 Action Detection challenge using a single model, surpassing the baseline method by 11.7\% in terms of average mAP.
No abstract available
The aim of temporal action localization (TAL) is to determine the start and end frames of an action in a video. In recent years, TAL has attracted considerable attention because of its increasing applications in video understanding and retrieval. However, precisely estimating the duration of an action in the temporal dimension is still a challenging problem. In this paper, we propose an effective one‐stage TAL method based on a self‐defined motion data structure, called a dense joint motion matrix (DJMM), and a novel temporal detection strategy. Our method provides three main contributions. First, compared with mainstream motion images, DJMMs can preserve more pre‐processed motion features and provides more precise detail representations. Furthermore, DJMMs perfectly solve the temporal information loss problem caused by motion trajectory overlaps within a certain time period. Second, a spatial pyramid pooling (SPP) layer, which is widely used in the object detection and tracking fields, is innovatively incorporated into the proposed method for multi‐scale feature learning. Moreover, the SPP layer enables the backbone convolutional neural network (CNN) to receive DJMMs of any size in the temporal dimension. Third, a large‐scale‐first temporal detection strategy inspired by a well‐developed Chinese text segmentation algorithm is proposed to address long‐duration videos. Our method is evaluated on two benchmark data sets and one self‐collected data set: Florence‐3D, UTKinect‐Action3D and HanYue‐3D. The experimental results show that our method achieves competitive action recognition accuracy and high TAL precision, and its time efficiency and few‐shot learning capabilities enable it to be utilized for real‐time surveillance.
Surgical performance depends not only on surgeons’ technical skills, but also on team communication within and across the different professional groups present during the operation. Therefore, automatically identifying team communication in the OR is crucial for patient safety and advances in the development of computer-assisted surgical workflow analysis and intra-operative support systems. To take the first step, we propose a new task of detecting communication briefings involving all OR team members, i.e., the team Time-out and the StOP?-protocol, by localizing their start and end times in video recordings of surgical operations. We generate an OR dataset of real surgeries, called Team-OR, with more than one hundred hours of surgical videos captured by the multi-view camera system in the OR. The dataset contains temporal annotations of 33 Time-out and 22 StOP?-protocol activities in total. We then propose a novel group activity detection approach, where we encode both scene context and action features, and use an efficient neural network model to output the results. The experimental results on the Team-OR dataset show that our approach outperforms existing state-of-the-art temporal action detection approaches. It also demonstrates the lack of research on group activities in the OR, proving the significance of our dataset. We investigate the Team Time-Out and the StOP?-protocol in the OR, by presenting the first OR dataset with temporal annotations of group activities protocols, and introducing a novel group activity detection approach that outperforms existing approaches. Code is available at https://github.com/CAMMA-public/Team-OR.
Video activity localization aims at understanding the semantic content in long untrimmed videos and retrieving actions of interest. The retrieved action with its start and end locations can be used for highlight generation, temporal action detection, etc. Unfortunately, learning the exact boundary location of activities is highly challenging because temporal activities are continuous in time, and there are often no clear-cut transitions between actions. Moreover, the definition of the start and end of events is subjective, which may confuse the model. To alleviate the boundary ambiguity, we propose to study the video activity localization problem from a denoising perspective. Specifically, we propose an encoder-decoder model named DenoiseLoc. During training, a set of action spans is randomly generated from the ground truth with a controlled noise scale. Then we attempt to reverse this process by boundary denoising, allowing the localizer to predict activities with precise boundaries and resulting in faster convergence speed. Experiments show that DenoiseLoc advances %in several video activity understanding tasks. For example, we observe a gain of +12.36% average mAP on QV-Highlights dataset and +1.64% mAP@0.5 on THUMOS'14 dataset over the baseline. Moreover, DenoiseLoc achieves state-of-the-art performance on TACoS and MAD datasets, but with much fewer predictions compared to other current methods.
Researchers in natural science need reliable methods for quantifying animal behavior. Recently, numerous computer vision methods emerged to automate the process. However, observing wild species at remote locations remains a challenging task due to difficult lighting conditions and constraints on power supply and data storage. Event cameras offer unique advantages for battery-dependent remote monitoring due to their low power consumption and high dynamic range capabilities. We use this novel sensor to quantify a behavior in Chinstrap penguins called ecstatic display. We formulate the problem as a temporal action detection task, determining the start and end times of the behavior. For this purpose, we recorded a colony of breeding penguins in Antarctica for several weeks and labeled event data on 16 nests. The developed method consists of a generator of candidate time intervals (proposals) and a classifier of the actions within them. The experiments show that the event cameras' natural response to motion is effective for continuous behavior monitoring and detection, reaching a mean average precision (mAP) of 58% (which increases to 63% in good weather conditions). The results also demonstrate the robustness against various lighting conditions contained in the challenging dataset. The low-power capabilities of the event camera allow it to record significantly longer than with a conventional camera. This work pioneers the use of event cameras for remote wildlife observation, opening new interdisciplinary opportunities. https://tub-rip.github.ioleventpenguins/
Weakly supervised video anomaly detection is an important problem in many real-world applications where during training there are some anomalous videos, in addition to nominal videos, without labelled frames to indicate when the anomaly happens. State-of-the-art methods in this domain typically focus on offline anomaly detection without any concern for real-time detection. Most of these methods rely on ad hoc feature aggregation techniques and the use of metric learning losses, which limit the ability of the models to detect anomalies in real-time. In line with the premise of deep neural networks, there also has been a growing interest in developing end-to-end approaches that can automatically learn effective features directly from the raw data. We propose the first real-time and end-to-end trained algorithm for weakly supervised video anomaly detection. Our training procedure builds upon recent action recognition literature and trains a large video model to learn visual features. This is in contrast to existing approaches which largely depend on pre-trained feature extractors. The proposed method significantly improves the anomaly detection speed and AUC performance compared to the existing methods. Specifically, on the UCF-Crime dataset, our method achieves 86.94% AUC with a decision period of 6.4 seconds while the competing methods achieve at most 85.92% AUC with a decision period of 273 seconds.
No abstract available
本报告统一了时序动作定位(TAL)领域的六大核心研究方向。整体趋势表现为:技术架构正经历从卷积神经网络向 Transformer 及 Mamba (SSM) 等能够处理长时序依赖的先进架构跨越;监督范式由重度依赖帧级标注的全监督向弱监督、点级监督及开集/零样本学习演进,以解决数据标注瓶颈;算法核心仍聚焦于边界精细化建模以提升定位精度;同时,研究视野已从实验室基准数据集扩展到实时在线监测、多模态融合及多样化的工业应用场景(如电力、医疗、体育),并开始注重模型的计算效率与复杂环境下的鲁棒性。