基于YOLOv8姿态估计与时空注意力模型的小样本人体行为识别研究
基于YOLOv8与姿态估计的行为识别基础框架
该组文献聚焦于利用YOLOv8-Pose等先进姿态估计模型作为行为识别的预处理核心,构建从关键点提取到序列建模的完整技术路线。
- Deep Learning-based Human Pose Estimation: A Survey(Ce Zheng, Wenhan Wu, Taojiannan Yang, Sijie Zhu, Chen Chen, Ruixu Liu, Ju Shen, N. Kehtarnavaz, M. Shah, 2020, ACM Computing Surveys)
- A comprehensive survey on human pose estimation approaches(Shradha Dubey, M. Dixit, 2022, Multimedia Systems)
- Overview of behavior recognition based on deep learning(Kai Hu, Junlan Jin, Fei Zheng, L. Weng, Yiwu Ding, 2022, Artificial Intelligence Review)
- Human Pose Estimation Using Deep Learning: A Systematic Literature Review(Esraa Samkari, Muhammad Arif, Manal Alghamdi, M. A. Ghamdi, 2023, Machine Learning and Knowledge Extraction)
- The Progress of Human Pose Estimation: A Survey and Taxonomy of Models Applied in 2D Human Pose Estimation(Tewodros Legesse Munea, Yalew Zelalem Jembre, Halefom Tekle Weldegebriel, Longbiao Chen, Chenxi Huang, Chenhui Yang, 2020, IEEE Access)
- Human Pose Estimation from Video and IMUs(T. V. Marcard, Gerard Pons-Moll, B. Rosenhahn, 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence)
- Survey on Dl-Based Object Detection and Pose Estimation for Human-Robot Collaboration Manufacturing(Imran Rashid, Junfeng Wang, SULMAN AHMED, faheem ahmed, 2025, SSRN Electronic Journal)
- Human Pose Estimation and Activity Recognition From Multi-View Videos: Comparative Explorations of Recent Developments(M. B. Holte, Cuong Tran, M. Trivedi, T. Moeslund, 2012, IEEE Journal of Selected Topics in Signal Processing)
- Poster: Temporal Action Recognition Combining Yolov8-Pose and Bilstm(X. Shangguan, Wenping Yu, Yancui Shi, Xiaoxiao Lu, 2025, 2025 IEEE International Symposium on Parallel and Distributed Processing with Applications (ISPA))
- Distracted Driving Behavior Recognition Based on Improved YOLOv8n-Pose and Multi-Feature Fusion(Zhuzhou Li, Dudu Guo, Zhenxun Wei, Guoliang Chen, Miao Sun, Yuhao Sun, 2026, Applied Sciences)
- Unifying Human Pose Estimation in the Fall Detection Problem(Egor Surkov, Oleg Seredin, A. Kopylov, O. Kushnir, 2024, Pattern Recognition and Image Analysis)
- Real-Time Classification of Operational Human Movements using YOLOv8-Pose and Feed-Forward Neural Networks(Deniz Eryılmaz, Ersin Alaybeyoğlu, 2025, 2025 9th International Symposium on Innovative Approaches in Smart Technologies (ISAS))
- Body-Pose-Guided Action Recognition with Convolutional Long Short-Term Memory (LSTM) in Aerial Videos(Sohaib Saeed, Hassan Akbar, Tahir Nawaz, H. Elahi, U. S. Khan, 2023, Applied Sciences)
- 基于骨架特征的行人过街意图识别(Jushou Lu, Hao Chen, Yuchuan Bai, Chuanpeng Hu, Xi Zhang, 2024, Journal of Shanghai Jiaotong University (Science))
时空注意力机制与动态特征建模
该组文献重点研究空间与时间注意力模块的设计,旨在捕获骨架序列中的关键部位权重与动态演变过程,增强模型对复杂行为的判别能力。
- Spatial-Temporal Hypergraph Based on Dual-Stage Attention Network for Multi-View Data Lightweight Action Recognition(Zhixuan Wu, Nan Ma, Cheng Wang, Cheng Xu, Genbao Xu, Mingxing Li, 2023, Pattern Recognition)
- Human action recognition using attention based LSTM network with dilated CNN features(Khan Muhammad, Mustaqeem, Amin Ullah, Ali Shariq Imran, M. Sajjad, M. S. Kıran, Giovanna Sannino, V. H. Albuquerque, 2021, Future Generation Computer Systems)
- Spatio-temporal hard attention learning for skeleton-based activity recognition(Bahareh Nikpour, N. Armanfard, 2023, Pattern Recognition)
- Decoupled Spatial-Temporal Attention Network for Skeleton-Based Action-Gesture Recognition(Lei Shi, Yifan Zhang, Jian Cheng, Hanqing Lu, 2020, Lecture Notes in Computer Science)
- Global Spatio-Temporal Attention for Action Recognition Based on 3D Human Skeleton Data(Yun Han, Sheng-Luen Chung, Qian Xiao, W. Lin, S. Su, 2020, IEEE Access)
- STAR-Transformer: A Spatio-temporal Cross Attention Transformer for Human Action Recognition(Dasom Ahn, Sangwon Kim, H. Hong, ByoungChul Ko, 2022, 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV))
- Hierarchical Spatial–Temporal Window Transformer for Pose-Based Rodent Behavior Recognition(Zhihao Ru, Feng Duan, 2024, IEEE Transactions on Instrumentation and Measurement)
- Spatial–temporal graph attention networks for skeleton-based action recognition(Q Huang, F Zhou, J He, Y Zhao, 2020, Journal of Electronic …)
- Lightweight Semantic-Guided Neural Networks Based on Single Head Attention for Action Recognition(Seon-Bin Kim, Chanhyuk Jung, Byeong-il Kim, ByoungChul Ko, 2022, Sensors)
- Spatial–Temporal Dynamic Graph Attention Network for Skeleton-Based Action Recognition(Mrugendrasinh L. Rahevar, A. Ganatra, T. Saba, A. Rehman, Saeed Ali Omer Bahaj, 2023, IEEE Access)
- Robust Human Action Recognition Using Global Spatial-Temporal Attention for Human Skeleton Data(Yun Han, Sheng-Luen Chung, Arulmurugan Ambikapathi, Jui-Shan Chan, W. Lin, S. Su, 2018, 2018 International Joint Conference on Neural Networks (IJCNN))
- Spatio-temporal segments attention for skeleton-based action recognition(Helei Qiu, B. Hou, Bo Ren, Xiaohua Zhang, 2022, Neurocomputing)
- LST-AGCN: A Novel Unified Lightweight Attention Framework for Efficient Skeleton-Based Action Recognition(Khadija Lasri, Khalid El Fazazy, A. M. Mahraz, Hamid Tairi, J. Riffi, 2026, Big Data and Cognitive Computing)
- LAGA-Net: Local-and-Global Attention Network for Skeleton Based Action Recognition(Rongjie Xia, Yanshan Li, Wenhan Luo, 2022, IEEE Transactions on Multimedia)
- Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection(Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, Jiaying Liu, 2018, IEEE Transactions on Image Processing)
- Skeleton-based Human Action Recognition via Large-kernel Attention Graph Convolutional Network(Yanan Liu, Hao Zhang, Yanqiu Li, Kangjian He, Dan Xu, 2023, IEEE Transactions on Visualization and Computer Graphics)
- Weakly-supervised temporal attention 3D network for human action recognition(Jong-Han Kim, Gen Li, Inyong Yun, Cheolkon Jung, Joongkyu Kim, 2021, Pattern Recognition)
- An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition(Chenyang Si, Wentao Chen, Wei Wang, Liang Wang, T. Tan, 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Self-Attention Network for Skeleton-based Human Action Recognition(Sangwoo Cho, M. H. Maqbool, Fei Liu, H. Foroosh, 2019, 2020 IEEE Winter Conference on Applications of Computer Vision (WACV))
- Spatio-temporal attention on manifold space for 3D human action recognition(Chongyang Ding, Kai Liu, Fei Cheng, E. Belyaev, 2020, Applied Intelligence)
- Towards To-a-T Spatio-Temporal Focus for Skeleton-Based Action Recognition(Lipeng Ke, Kuan-Chuan Peng, Siwei Lyu, 2022, Proceedings of the AAAI Conference on Artificial Intelligence)
- An End-to-End Spatio-Temporal Attention Model for Human Action Recognition from Skeleton Data(Sijie Song, Cuiling Lan, Junliang Xing, Wenjun Zeng, Jiaying Liu, 2016, Proceedings of the AAAI Conference on Artificial Intelligence)
- Lightweight graph convolutional network with multi-attention mechanisms for intelligent action recognition in online physical education(Yuhao You, 2025, PeerJ Computer Science)
- Human skeleton-based action recognition algorithm based on spatiotemporal attention graph convolutional network model(Y LI, J YUAN, H LIU, 2021, 计算机应用)
小样本学习与泛化性能优化策略
该组文献针对标注样本稀缺的挑战,探讨了数据增强、正则化、元学习及生成模型等策略,旨在提升模型在小样本条件下的泛化性能。
- Skeleton-Based Few-Shot Action Recognition via Fine-Grained Information Capture and Adaptive Metric Aggregation(Jingyun Tian, Jinjing Gu, Yuanyuan Pu, Zhengpeng Zhao, 2025, IEEE Transactions on Instrumentation and Measurement)
- Few‐shot Learning of Homogeneous Human Locomotion Styles(I. Mason, S. Starke, He Zhang, Hakan Bilen, T. Komura, 2018, Computer Graphics Forum)
- Learning Spatial-Preserved Skeleton Representations for Few-Shot Action Recognition(Ning Ma, Hongyi Zhang, Xuhui Li, Sheng Zhou, Zhen Zhang, Jun Wen, Haifeng Li, Jingjun Gu, Jiajun Bu, 2022, Lecture Notes in Computer Science)
- Benchmarking Federated Few-Shot Learning for Video-Based Action Recognition(Nguyen Anh Tu, Nartay Aikyn, Nursultan Makhanov, Assanali Abu, Kok-Seng Wong, Min-Ho Lee, 2024, IEEE Access)
- Generative Action Description Prompts for Skeleton-based Action Recognition(Wangmeng Xiang, C. Li, Yuxuan Zhou, Biao Wang, Lei Zhang, 2022, 2023 IEEE/CVF International Conference on Computer Vision (ICCV))
- SMAM: Self and Mutual Adaptive Matching for Skeleton-Based Few-Shot Action Recognition(Zhiheng Li, Xuyuan Gong, Ran Song, Peng Duan, Jun Liu, Wei Zhang, 2022, IEEE Transactions on Image Processing)
- Semantic-guided Cross-Modal Prompt Learning for Skeleton-based Zero-shot Action Recognition(Anqi Zhu, Jingmin Zhu, James Bailey, Mingming Gong, Qiuhong Ke, 2025, 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Parallel Attention Interaction Network for Few-Shot Skeleton-based Action Recognition(Xingyu Liu, Sanpin Zhou, Le Wang, Gang Hua, 2023, 2023 IEEE/CVF International Conference on Computer Vision (ICCV))
- Cross-Modal Contrastive Pre-Training for Few-Shot Skeleton Action Recognition(MingQi Lu, Siyuan Yang, Xiaobo Lu, Jun Liu, 2024, IEEE Transactions on Circuits and Systems for Video Technology)
- Enhancing Few-Shot Action Recognition Using Skeleton Temporal Alignment and Adversarial Training(Qingyang Xu, Jianjun Yang, Hongyi Zhang, Xin Jie, Danushka Bandara, 2024, IEEE Access)
- A Systematic Review of Skeleton-Based Action Recognition: Methods, Challenges, and Future Directions(Yi Liu, Ruyi Liu, Yuzhi Hu, Mengyao Wu, Wentian Xin, Qiguang Miao, Shuai Wu, Long Li, 2025, IEEE Transactions on Neural Networks and Learning Systems)
- Few-shot generative model for skeleton-based human action synthesis using cross-domain adversarial learning(Kenichiro Fukushi, Y. Nozaki, Kosuke Nishihara, Kentaro Nakahara, 2024, 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV))
- Leveraging Enriched Skeleton Representation With Multi-Relational Metrics for Few-Shot Action Recognition(Jingyun Tian, Jinjing Gu, Yuanyuan Pu, Zhengpeng Zhao, 2025, IEEE Transactions on Multimedia)
- Temporal-Viewpoint Transportation Plan for Skeletal Few-Shot Action Recognition(Lei Wang, Piotr Koniusz, 2023, Lecture Notes in Computer Science)
轻量化模型设计与实际应用系统
该组文献关注在资源受限场景下的模型轻量化设计,以及将行为识别技术集成到安防、康复、运动分析等实际应用系统中的研究。
- A Lightweight approach to human action recognition based on MotionBERT(QiHe Fang, Jin Qin, Huibin Qin, Hongshuai Qin, 2024, Proceedings of the 2024 8th International Conference on Advances in Artificial Intelligence)
- Pose-based Human Behavior Detection for Real-time Security Surveillance(Jua Park, Jeongin Cho, Soonchan Park, J. S. Lee, Moonwook Ryu, 2025, 2025 16th International Conference on Information and Communication Technology Convergence (ICTC))
- A Lightweight Skeleton-Based 3D-CNN for Real-Time Fall Detection and Action Recognition(Nadhira Noor, In Kyu Park, 2023, 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW))
- Skeletal Keypoint-Based Transformer Model for Human Action Recognition in Aerial Videos(Shahab Uddin, Tahir Nawaz, James Ferryman, N. Rashid, M. Asaduzzaman, R. Nawaz, 2024, IEEE Access)
- MCANet: a lightweight action recognition network with multidimensional convolution and attention(Qiuhong Tian, Weilun Miao, Lizao Zhang, Ziyu Yang, Yang Yu, Yanying Zhao, Lan Yao, 2024, International Journal of Machine Learning and Cybernetics)
- Real-Time Fall Detection in Clinical and HomeEnvironments Using YOLO-Based PoseEstimation and Spatio-Temporal Skeletal Features(Houssein Taleb, Mostafa Rizk, Chamseddine Zaki, Jad Abou Chaaya, Abbass Nasser, 2026, Research Square)
- Real-Time Student Behavior Recognition in Classroom Using Pose Estimation and Gaze Analysis(Huu‐Huy Ngo, Linh Le, Nguyen Duy Minh, Duc-Tuong Duong, Tien-Khai Vu, 2026, Lecture Notes in Networks and Systems)
- Deep learning-based control system for context-aware surveillance using skeleton sequences from IP and drone camera video(Vasavi Sanikommu, Sobhana Mummaneni, Novaline Jacob, Emmanuel K.C, B. Kumar, Radha Variam, 2025, The International Arab Journal of Information Technology)
- P-CNN: Pose-Based CNN Features for Action Recognition(Guilhem Chéron, I. Laptev, C. Schmid, 2015, 2015 IEEE International Conference on Computer Vision (ICCV))
- Skeleton-Based Posture Estimation for Human Action Recognition Using Deep Learning(Minh-Trieu Truong, Van-Dung Hoang, Thi-Minh-Chau Le, 2024, Lecture Notes in Networks and Systems)
- Student Behavior Recognition System for the Classroom Environment Based on Skeleton Pose Estimation and Person Detection(Feng-Cheng Lin, Huu-Huy Ngo, C. Dow, Ka-Hou Lam, Hung Linh Le, 2021, Sensors)
- Recognizing Human Actions as the Evolution of Pose Estimation Maps(Mengyuan Liu, Junsong Yuan, 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition)
- AI Enabled Human Action Recognition(Saloni Sah, Suchit Purohit, 2025, Communications in Computer and Information Science)
- Human Posture Estimation and Sustainable Events Classification via Pseudo-2D Stick Model and K-ary Tree Hashing(Ahmad Jalal, Israr Akhtar, Kibum Kim, 2020, Sustainability)
- Real-Time Human Pose Detection and Recognition Using MediaPipe(Amrita Singh, Vedant Arvind Kumbhare, K. Arthi, 2022, Advances in Intelligent Systems and Computing)
- Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning(Xinshun Wang, Zhongbin Fang, Xia Li, Xiangtai Li, Chen Chen, Mengyuan Liu, 2023, 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))
- Advanced Human Pose Estimation and Event Classification Using Context-Aware Features and XGBoost Classifier(Wasim Wahid, A. Alarfaj, Abtisam Ebdullah Alabdulqader, Touseef Sadiq, Hameedur Rahman, Ahmad Jalal, 2024, IEEE Access)
- Video-based Fall Detection for Seniors with Human Pose Estimation(Zhanyuan Huang, Yang Liu, Yajun Fang, B. Horn, 2018, 2018 4th International Conference on Universal Village (UV))
- Human Pose Estimation and Activity Classification Using Machine Learning Approach(J. Arunnehru, A. Davi, R. Sharan, Poornima G. Nambiar, 2019, Advances in Intelligent Systems and Computing)
- Skeleton-based multi-person action recognition towards real-world violence detection(Minh Q. Truong, Van-Dung Hoang, 2025, Engineering Applications of Artificial Intelligence)
- Human Activity Recognition System Based on Feature Fusion and Lightweight Spatiotemporal Attention Model(C. Wong, Huaching Chen, Hsuan-Ming Feng, 2026, Cybernetics and Systems)
- Interpretable Classification of Human Exercise Videos Through Pose Estimation and Multivariate Time Series Analysis(Ashish Singh, Binh Thanh Le, Thach Le Nguyen, Darragh Whelan, Martin O’Reilly, Brian Caulfield, Georgiana Ifrim, 2022, Studies in Computational Intelligence)
- Exploring real-time action recognition for collapse detection in low-density areas: a YOLOv8-Pose and multi-LISAAL(TAC Rendon, ALM Morales, 2025, Journal of Physics …)
- Research on Tennis Pose-based Action Recognition and Evaluation Based on Skeletal Key Points(He Zhang, Jianrui Fu, Juan Wang, 2026, 2026 9th International Symposium on Big Data and Applied Statistics (ISBDAS))
- Teaching behaviors recognition by combining deep learning-based human body detection and pose estimation(Yulu Peng, Shenglian Lu, Zhiliang Qiu, Jijie Wang, 2024, Proceedings of the 2024 International Symposium on Artificial Intelligence for Education)
- Self-Supervised Skeleton-Based Action Representation Learning: A Benchmark and Beyond(Jiahang Zhang, Lilang Lin, Shuai Yang, Jiaying Liu, 2024, International Journal of Computer Vision)
本报告将人体行为识别领域的研究整合为四个核心维度:基于YOLOv8的姿态估计基础框架、时空注意力机制与特征建模、小样本学习与泛化策略、以及轻量化模型与实际应用系统。该分类体系系统性地涵盖了从关键点提取、动态特征增强到小样本训练及工程化落地的全流程,为解决小样本条件下人体行为检测的挑战提供了坚实的理论与技术支撑。
总计78篇相关文献
… Another model we proposed in this task is using YOLOv8 Pose for … The YOLOv8 Pose model is pretrained on the COCO dataset… However, for the YOLOv8 Pose model, the number of …
Several efforts have been made to develop effective and robust vision-based solutions for human action recognition in aerial videos. Generally, the existing methods rely on the extraction of either spatial features (patch-based methods) or skeletal key points (pose-based methods) that are fed to a classifier. Unlike the patch-based methods, the pose-based methods are generally regarded to be more robust to background changes and computationally efficient. Moreover, at the classification stage, the use of deep networks has generated significant interest within the community; however, the need remains to develop accurate and computationally effective deep learning-based solutions. To this end, this paper proposes a lightweight Transformer network-based method for human action recognition in aerial videos using the skeletal keypoints extracted using YOLOv8. The effectiveness of the proposed method is shown on a well-known public dataset containing 13 action classes, achieving very encouraging performance in terms of accuracy and computational cost as compared to several existing related methods.
Implementing skeleton-based action recognition in real-world applications is a difficult task, because it involves multiple modules such as person detection and pose estimaton. In terms of context, skeleton-based approach has the strong advantage of robustness in understanding actual human actions. However, for most real-world videos in the standard benchmark datasets, human poses are not easy to detect, (i.e. only partially visible or occluded by other objects), and existing pose estimators mostly fail to detect the person during the falling motion. Thus, we propose a newly augmented human pose dataset to improve the accuracy of pose extraction. Furthermore, we propose a lightweight skeleton-based 3D-CNN action recognition network that shows significant improvement on accuracy and processing time over the baseline. Experimental results show that the proposed skeleton-based method shows high accuracy and efficiency in real world scenarios.
… An end-to-end hierarchical recurrent neural network (RNN) is proposed for skeleton-based action recognition to effectively model both the spatial structure and temporal dynamics of …
… Finally, action identification aligns with their respective labels, utilizing a sigmoid function for … Skeleton-based Action Recognition (SAR) Model Skeleton-based human action recognition …
Human action recognition is crucial in intelligent systems like robotics, healthcare, and surveillance, gaining significant attention with the rise of AI in computer vision. This work presents …
This paper presents the technical framework of an integrated video analysis platform designed for the automated, real-time detection and classification of human movements. The system leverages the YOLOv81-pose model for robust 2 D human pose estimation, extracting critical upper-body keypoints (shoulders, elbows, wrists) from video streams. Subsequently, a custom-designed feed-forward neural network (specifically, a Multi-Layer Perceptron - MLP) classifies motion patterns based on features engineered from these keypoints. The feature engineering process incorporates normalization relative to body landmarks (inter-shoulder distance and body center) and the computation of relative angles between limb segments, providing pose invariance and discriminative power. The platform employs a modular pipeline encompassing video preprocessing (frame sampling at a target FPS), keypoint extraction with confidence thresholding, feature engineering, and model inference. This structured approach enables the system to accurately classify distinct operational movements, such as ‘picking/placing’ and ‘planting’ actions, as demonstrated on custom datasets. While the core components focus on single-stream processing and model training, the platform’s modular design supports potential extensions towards distributed architectures for scalable processing of data from multiple camera sources. The developed system holds significant potential for applications in industrial automation, human-robot collaboration, ergonomic assessments, and workplace safety monitoring.
Human Activity Recognition (HAR) combined with face recognition is set to play a decisive role in next-generation surveillance systems. This work presents a hybrid methodology that integrates deep learning and machine learning models for recognizing multi person activities and faces. The work is structured into two different parts: face recognition and human activity recognition. For face recognition, faces are detected using the state-of-the-art Multi-Task Cascaded Convolutional Neural Network (MTCNN) model, followed by key point extraction with the FaceNet model. The extracted embeddings are classified using a Support Vector Machine (SVM) to identify individuals. SVM model achieved classification accuracy of 0.99. For activity recognition, an ensemble model is employed to classify six activities: walking, standing, sitting, punching, kicking, and crawling. The YOLOv8 large pose model is used to extract human skeletons, which are then fed into the ensemble machine learning model for classification. This integrated system demonstrates promising performance for real-time surveillance applications that detect and recognize the multi person activity and track the person. Generation of summary report is one of the most important phase of this work where the location details of a person is stored along with activity being performed by the person. If abnormal activity is recorded, then the system will generate the early warning system that helps for better surveillance purposes
… papers on skeleton-based human action recognition. … The skeleton-based human action recognition approach has … YOLOv8-pose builds a 2D skeleton model of a human figure (Fig…
The accurate detection and recognition of human actions play a pivotal role in aerial surveillance, enabling the identification of potential threats and suspicious behavior. Several approaches have been presented to address this problem, but the limitation still remains in devising an accurate and robust solution. To this end, this paper presents an effective action recognition framework for aerial surveillance, employing the YOLOv8-Pose keypoints extraction algorithm and a customized sequential ConvLSTM (Convolutional Long Short-Term Memory) model for classifying the action. We performed a detailed experimental evaluation and comparison on the publicly available Drone Action dataset. The evaluation and comparison of the proposed framework with several existing approaches on the publicly available Drone Action dataset demonstrate its effectiveness, achieving a very encouraging performance. The overall accuracy of the framework on three provided dataset splits is 74%, 80%, and 70%, with a mean accuracy of 74.67%. Indeed, the proposed system effectively captures the spatial and temporal dynamics of human actions, providing a robust solution for aerial action recognition.
Falls have continued to pose a significant risk, particularly for the elderly. Preventing injuries and fatalities has required accurate and timely detection. However, the complexity of real-world environments and the need for precision have presented ongoing challenges to existing fall detection systems. While wear-able sensors have proven useful, they are often uncomfortable for continuous use, and traditional detection methods have demonstrated unreliability due to their sensitivity to environmental conditions. Consequently, the development of a more accurate, real-time, non-invasive, and environment-independent detection 1 approach has become essential. In this study, we have developed and evaluated two novel vision-based fall detection systems. In the first system, we have employed You Only Look Once , version 8 (YOLOv8) or YOLOv11 for real-time detection of both the person and the bed within each video frame. Subsequently, we have applied AlphaPose to extract human body keypoints, followed by action recognition using Spatial-Temporal Graph Convolutional Networks (ST-GCN). A custom fall detection logic has been integrated, which evaluates both posture and spatial position relative to the bed to confirm fall events. In the second system , we have utilized pose-based models (YOLOv8-pose or YOLOv11-pose) that simultaneously detect the person and estimate keypoints. Based on this data, we have designed an independent fall logic that classifies fall events through posture and location analysis. This system has also incorporated a real-time alert mechanism that sends WhatsApp notifications to enable immediate response in the event of a fall. Experimental results have demonstrated that both systems offer robust and reliable fall detection across various scenarios, significantly enhancing safety and supporting the well-being of individuals at risk.
… This study utilizes YOLOv8, YOLOv8track[11], and YOLOv8-pose for pedestrian detection, … action recognition and behavior analysis. The dataset contains video clips of 9 different action …
To address the challenge of balancing recognition speed and accuracy in action recognition, this paper proposes an integrated framework that combines lightweight keypoint detection with bidirectional temporal modeling. Specifically, YOLOv8-Pose is employed to extract 17-point pose features with normalized coordinates for improved generalization, while a BiLSTM network captures long-range temporal dependencies. Experiments on the UTD-MHAD dataset show that the proposed YOLOv8m-Pose+BiLSTM method achieves 82.16% accuracy at 42 FPS, demonstrating a practical balance of efficiency and performance.
… YOLOv8-Pose, known for their real-time detection capabilities, have evolved to simultaneously recognize … for real-time skeleton based human action recognition and progress prediction. …
Few-shot human action recognition, a prominent area in computer vision, has garnered increasing attention and broader use in real-life scenarios. Extracting spatio-temporal skeletal information from human movement videos offers interpretable and data-efficient features. However, existing spatio-temporal feature encoders face challenges such as handling sequence boundaries and coping with noise. In order to solve the above problems, this paper proposes a temporal complement method to optimize the Dynamic Time Warping (DTW) algorithm based on the feature representation of the human skeleton sequence. DTW helps to find optimal alignment between sequences by warping them in the time domain. This is quite useful specially in scenarios where training data is limited. However, DTW has the drawback that the optimal alignment path is highly sensitive to errors in the time series distance matrix. Therefore, we apply a Virtual Adversarial Training method to improve the anti-noise capability of the algorithm. Here, We introduce adversarial perturbations in the training phase to the time series distance matrix, thus incentivizing the model to be resilient to such noise. Our method achieves highest accuracy among protonet, DTW and DASTM methods for the 5-way-1-shot setting for the NTU-S (77.7%), and Kinetics (41.2%) datasets. For the 5-way-5-shot setting, our method achieves highest accuracy of 51.8% for Kinetics dataset when compared with the other approaches.
… joints annotations for human action recognition tasks, containing 113,945 skeleton sequences with 25 body joints for each skeleton. In our experiments, we use 120 action categories, …
We propose few-shot generative models of skeleton-based human actions on limited samples of the target domain. We exploit large public datasets as a source of motion variations by introducing novel cross-domain and entropy regularization losses that effectively transfer the diversity of the motions contained in the source to the target domain. First, target samples are divided into patches, which are a set of short motion clips. For each patch, we search for a reference motion from the source dataset that is similar to the patch. Next, in adversarial training, our cross-domain regularization encourages the generated sequences to resemble the reference motion at the patch level. Entropy regularization prevents mode collapse by forcing the generator to follow the distribution of the source dataset. Experiments are performed on public datasets where we utilize three action classes from NTU RGB+D 120 as the target and all data of 60 action classes in NTU RGB+D as the source. Ten samples for each target action class, 30 in total, are selected as target data. The results demonstrate that data augmented with the proposed method improve recognition accuracy by 28 % using a ST-GCN classifier.
This paper focuses on skeleton-based few-shot action recognition. Since skeleton is essentially a sparse representation of human action, the feature maps extracted from it, through a standard encoder network in the few-shot condition, may not be sufficiently discriminative for some action sequences that look partially similar to each other. To address this issue, we propose a self and mutual adaptive matching (SMAM) module to convert such feature maps into more discriminative feature vectors. Our method, named as SMAM-Net, first leverages both the temporal information associated with each individual skeleton joint and the spatial relationship among them for feature extraction. Then, the SMAM module adaptively measures the similarity between labeled and query samples and further carries out feature matching within the query set to distinguish similar skeletons of various action categories. Experimental results show that the SMAM-Net outperforms other baselines on the large-scale NTU RGB + D 120 dataset in the tasks of one-shot and five-shot action recognition. We also report our results on smaller datasets including NTU RGB + D 60, SYSU and PKU-MMD to demonstrate that our method is reliable and generalises well on different datasets. Codes and the pretrained SMAM-Net will be made publicly available.
This paper proposes a novel approach for few-shot skeleton action recognition that comprises of two stages: cross-modal pre-training of a skeleton encoder, followed by fine-tuning of a cosine classifier on the support set. The pre-training and fine-tuning approach has been demonstrated to be more effective for handling few-shot tasks compared to utilizing more intricate meta-learning methods. However, its success relies on the availability of a large-scale training dataset, which yet is difficult to obtain. To address this challenge, we introduce a cross-modal pre-training framework based on Bootstrap Your Own Latent (BYOL), which considers skeleton sequences and their corresponding videos as augmented views of the same action in different modalities. By utilizing a simple regression loss, the framework is able to transfer robust and high-quality vision-language representations to the skeleton encoder. This allows the skeleton encoder to gain a comprehensive understanding of action sequences and benefit from the prior knowledge obtained from a vision-language pre-trained model. The representation transfer enhances the feature extraction capability of the skeleton encoder, compensating for the lack of large-scale skeleton datasets. Extensive experiments on the NTU RGB+D, NTU RGB+D 120, PKU-MMD, NW-UCLA, and MSR Action Pairs datasets demonstrate that our proposed approach achieves state-of-the-art performances for few-shot skeleton action recognition.
Recognizing novel actions based on a few labeled skeleton sequence samples is a promising field. Existing works primarily focus on global representations of similar actions and designing novel metric functions. This approach overlooks fine-grained and unique characteristics within samples, which are crucial for distinguishing subtle differences between actions. Moreover, as the field advances, designing increasingly complex metric functions yields diminishing returns in performance improvements. To address these issues, we propose the fine-grained information capture and adaptive metric aggregation (FICAMA) framework for skeleton-based few-shot action recognition (FSAR). This framework enhances the self-information of support samples to facilitate the capture of fine-grained representations. The combination of fine-grained representations and coarse-grained global representations contributes to better action matching. We then employ the skeletal motion fusion module to spatially combine coarse-grained and fine-grained representations within samples and introduce temporal context information through positional encoding. In addition, we design an adaptive multimetric distance aggregation module (AMA) capable of simultaneously utilizing multiple metric functions to aggregate the results of different metric functions in a task-adaptive manner and achieve high-precision skeleton sequence matching. Experiments on the NTU RGB+D 120 and Kinetics datasets demonstrate the effectiveness of our approach. Code is available at https://github.com/jinjinggu00/FICAMA.
Skeleton-based action recognition has recently received considerable attention. Current approaches to skeleton-based action recognition are typically formulated as one-hot classification tasks and do not fully exploit the semantic relations between actions. For example, "make victory sign" and "thumb up" are two actions of hand gestures, whose major difference lies in the movement of hands. This information is agnostic from the categorical one-hot encoding of action classes but could be unveiled from the action description. Therefore, utilizing action description in training could potentially benefit representation learning. In this work, we propose a Generative Action-description Prompts (GAP) approach for skeleton-based action recognition. More specifically, we employ a pre-trained large-scale language model as the knowledge engine to automatically generate text descriptions for body parts movements of actions, and propose a multi-modal training scheme by utilizing the text encoder to generate feature vectors for different body parts and supervise the skeleton encoder for action representation learning. Experiments show that our proposed GAP method achieves noticeable improvements over various baseline models without extra computation cost at inference. GAP achieves new state-of-the-arts on popular skeleton-based action recognition benchmarks, including NTU RGB+D, NTU RGB+D 120 and NW-UCLA. The source code is available at https://github.com/MartinXM/GAP.
Learning discriminative features from very few labeled samples to identify novel classes has received increasing attention in skeleton-based action recognition. Existing works aim to learn action-specific embeddings by exploiting either intra-skeleton or inter-skeleton spatial associations, which may lead to less discriminative representations. To address these issues, we propose a novel Parallel Attention Interaction Network (PAINet) that incorporates two complementary branches to strengthen the match by inter-skeleton and intraskeleton correlation. Specifically, a topology encoding module utilizing topology and physical information is proposed to enhance the modeling of interactive parts and joint pairs in both branches. In the Cross Spatial Alignment branch, we employ a spatial cross-attention module to establish joint associations across sequences, and a directional Average Symmetric Surface Metric is introduced to locate the closest temporal similarity. In parallel, the Cross Temporal Alignment branch incorporates a spatial self-attention module to aggregate spatial context within sequences as well as applies the temporal cross-attention network to correct misalignment temporally and calculate similarity. Extensive experiments on three skeleton benchmarks, namely NTU-T, NTU-S, and Kinetics, demonstrate the superiority of our framework and consistently outperform state-of-the-art methods.
A Systematic Review of Skeleton-Based Action Recognition: Methods, Challenges, and Future Directions
Human action recognition (HAR), which aims to recognize and understand individual actions and intentions, has rapidly become a research hotspot in computer vision. Compared with other data modalities, skeleton data offers more efficient node semantics and more coherent spatio-temporal motion patterns, effectively reducing the impact of lighting and background changes. In recent years, many researchers have focused on skeleton-based action recognition methods and have made significant progress. However, we believe that the current skeleton-based action recognition methods still face three major challenges: 1) reducing reliance on expensive labeled data while maintaining model performance; 2) enabling the model to understand and recognize new behavior classes with a limited number of samples; and 3) addressing the challenges posed by the lack of skeleton information in single-modality spatio-temporal motion representation learning. Based on these challenges, we conduct a comprehensive review of the existing skeleton-based action recognition methods. Additionally, we provide an extensive review and analysis of publicly available action recognition datasets. This review aims to offer researchers a comprehensive perspective, stimulate more innovative ideas, and promote the application and breakthrough of skeleton action recognition in a wider range of computer vision tasks.
Few-shot action recognition aims to identify new action classes with limited training samples. Most existing methods overlook the low information content and diversity of skeleton features, failing to exploit useful information in rare samples during meta-training. This leads to poor feature discriminability and recognition accuracy. To address both issues, we propose a novel Enriched Skeleton Representation and Multi-relational Metrics (ESR-MM) method for skeleton-based few-shot action recognition. First, a Frobenius Norm Diversity Loss is introduced to enrich skeleton representation by maximizing the Frobenius norm of the skeleton feature matrix. This mitigates over-smoothing and boosts information content and diversity. Leveraging these enriched features, we propose a multi-relational metrics strategy exploiting cross-sample task-specific information, intra-sample temporal order, and inter-sample distance. Specifically, Support-Adaptive Attention leverages task-specific cues between samples to generate attention-enhanced features. Then, the Bidirectional Temporal Coherent Mean Hausdorff Metric integrates Temporal Coherence Measure into the Bidirectional Mean Hausdorff Metric for class separation by accounting for temporal order. Finally, Prototype-discriminative Contrastive Loss exploits distances from class prototypes to query samples. ESR-MM demonstrates superior performance on two benchmarks.
… We propose a Few-shot Learning pipeline for 3D skeleton-based action … , LY, Kot, AC: Ntu rgb+d 120: A largescale benchmark for 3d human activity understanding. IEEE TPAMI (2019) 2…
Skeleton-based human action recognition is promising due to its privacy preservation, robustness to visual challenges, and computational efficiency. Especially, the practical necessity to recognize unseen actions has led to increased interest in zero-shot skeleton-based action recognition (ZSSAR). Existing ZSSAR approaches often rely on manually crafted action descriptions or visual assumptions to enhance knowledge transfer, which is limited in flexibility and prone to inaccuracies and noise. To overcome this, we introduce Semantic-guided Cross-Modal Prompt Learning (SCoPLe), a novel framework that replaces manual guidance with data-driven prompt learning for refinement and alignment of skeletal and textual features. Specifically, we introduce a dual-stream language prompting module that preserves the original semantic context from the pre-trained text encoder while still effectively tuning its ouput for ZSSAR task adaptation. We also introduce a joint-shaped prompting module that learns tuning for skeleton features and incorporate an adaptive visual representation sampler that leverages text semantics to strengthen the cross-modal prompting interactions during skeleton-to-text embedding projection. Experimental results on the NTU-RGB+D and PKU-MMD datasets demonstrate the state-of-the-art performance of our method in both ZS-SAR and generalized ZSSAR scenarios.
Human action recognition is an important task in computer vision. Extracting discriminative spatial and temporal features to model the spatial and temporal evolutions of different actions plays a key role in accomplishing this task. In this work, we propose an end-to-end spatial and temporal attention model for human action recognition from skeleton data. We build our model on top of the Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM), which learns to selectively focus on discriminative joints of skeleton within each frame of the inputs and pays different levels of attention to the outputs of different frames. Furthermore, to ensure effective training of the network, we propose a regularized cross-entropy loss to drive the model learning process and develop a joint training strategy accordingly. Experimental results demonstrate the effectiveness of the proposed model, both on the small human action recognition dataset of SBU and the currently largest NTU dataset.
Human action analytics has attracted a lot of attention for decades in computer vision. It is important to extract discriminative spatio-temporal features to model the spatial and temporal evolutions of different actions. In this paper, we propose a spatial and temporal attention model to explore the spatial and temporal discriminative features for human action recognition and detection from skeleton data. We build our networks based on the recurrent neural networks with long short-term memory units. The learned model is capable of selectively focusing on discriminative joints of skeletons within each input frame and paying different levels of attention to the outputs of different frames. To ensure effective training of the network for action recognition, we propose a regularized cross-entropy loss to drive the learning process and develop a joint training strategy accordingly. Moreover, based on temporal attention, we develop a method to generate the action temporal proposals for action detection. We evaluate the proposed method on the SBU Kinect Interaction data set, the NTU RGB + D data set, and the PKU-MMD data set, respectively. Experiment results demonstrate the effectiveness of our proposed model on both action recognition and action detection.
Human body skeleton, acting as a spatiotemporal graph, is increasing attentions of researchers to adopt graph convolutional networks (GCN) to mine the discriminative features from skeleton joints. However, one of GCN’s flaws is its inability to handle long-distance reliance between joints. In this regard, graph attention network (GAT) was recently suggested, which combines graph convolutions with a self-attention mechanism to extract the most informative joint of a human skeleton and increase the model accuracy. However, GAT can compute only static attention: for each query node, the attention rank is same which severely hurts and limits the expressivity of an attention mechanism. In this work, we present a spatial-temporal dynamic graph attention network (ST-DGAT) to learn the spatial-temporal patterns of skeleton sequences. For dynamic graph attention, we tweak the order of weighted vector operations in GAT, our approach achieves a global approximate attention function, making it strictly superior to GAT. Experiments show that by fixing the order of internal operation of GAT the proposed model achieved better action classification results while maintaining the same computing cost as GAT. The proposed framework has been evaluated on well-known publicly available large-scale datasets NTU60, NTU120, and Kinetics-400, which notably outperforms state-of-the-art (SOTA) results with an accuracy of 96.4%, 88.2%, and 61.0%, respectively.
… Capturing the dependencies between joints is critical in skeleton-based action recognition. … , a novel spatio-temporal segments attention method is proposed. The skeleton sequence is …
The human skeleton joints captured by RGB-D camera are widely used in action recognition for its robust and comprehensive 3D information. Presently, most action recognition methods based on skeleton joints treat all skeletal joints with the same importance spatially and temporally. However, the contributions of skeletal joints vary significantly. Hence, a GL-LSTM+Diff model is proposed to improve the recognition of human actions. A global spatial attention (GSA) model is proposed to express the different weights for different skeletal joints to provide precise spatial information for human action recognition. The accumulative learning curve (ALC) model is introduced to highlight which frames contribute most to the final decision making by giving varying temporal weights to each intermediate accumulated learning results. By integrating the proposed GSA (for spatial information) and ALC (for temporal processing) models into the LSTM framework and taking the human skeletal joints as inputs, a global spatio-temporal action recognition framework (GL-LSTM) is constructed to recognize human actions. Diff is introduced as the preprocessing method to enhance the dynamic of the features, thus to get distinguishable features in deep learning. Rigorous experiments on the largest dataset NTU RGB+D and the common small dataset SBU show that the algorithm proposed in this paper outperforms other state-of-the-art methods.
… vary across frames and activities. In this paper, we propose a novel framework for finding temporal and spatial attentions in a cooperative manner for activity recognition. The proposed …
Graph Convolutional Networks (GCNs) have been widely used to model the high-order dynamic dependencies for skeleton-based action recognition. Most existing approaches do not explicitly embed the high-order spatio-temporal importance to joints’ spatial connection topology and intensity, and they do not have direct objectives on their attention module to jointly learn when and where to focus on in the action sequence. To address these problems, we propose the To-a-T Spatio-Temporal Focus (STF), a skeleton-based action recognition framework that utilizes the spatio-temporal gradient to focus on relevant spatio-temporal features. We first propose the STF modules with learnable gradient-enforced and instance-dependent adjacency matrices to model the high-order spatio-temporal dynamics. Second, we propose three loss terms defined on the gradient-based spatio-temporal focus to explicitly guide the classifier when and where to look at, distinguish confusing classes, and optimize the stacked STF modules. STF outperforms the state-of-the-art methods on the NTU RGB+D 60, NTU RGB+D 120, and Kinetics Skeleton 400 datasets in all 15 settings over different views, subjects, setups, and input modalities, and STF also shows better accuracy on scarce data and dataset shifting settings.
Skeleton-based action recognition has recently attracted a lot of attention. Researchers are coming up with new approaches for extracting spatio-temporal relations and making considerable progress on large-scale skeleton-based datasets. Most of the architectures being proposed are based upon recurrent neural networks (RNNs), convolutional neural networks (CNNs) and graph-based CNNs. When it comes to skeleton-based action recognition, the importance of long term contextual information is central which is not captured by the current architectures. In order to come up with a better representation and capturing of long term spatio-temporal relationships, we propose three variants of Self-Attention Network (SAN), namely, SAN-V1, SAN-V2 and SAN-V3. Our SAN variants has the impressive capability of extracting high-level semantics by capturing long-range correlations. We have also integrated the Temporal Segment Network (TSN) with our SAN variants which resulted in improved overall performance. Different configurations of Self-Attention Network (SAN) variants and Temporal Segment Network (TSN) are explored with extensive experiments. Our chosen configuration outperforms state-of-the-art Top-1 and Top-5 by 4.4% and 7.9% respectively on Kinetics and shows consistently better performance than state-of-the-art methods on NTU RGB+D.
In action recognition, although the combination of spatiotemporal videos and skeleton features can improve the recognition performance, a separate model and balancing feature representation for cross-modal data are required. To solve these problems, we propose Spatio-TemporAl cRoss (STAR)-transformer, which can effectively represent two cross-modal features as a recognizable vector. First, from the input video and skeleton sequence, video frames are output as global grid tokens and skeletons are output as joint map tokens, respectively. These tokens are then aggregated into multi-class tokens and input into STAR-transformer. The STAR-transformer encoder consists of a full spatio-temporal attention (FAttn) module and a proposed zigzag spatio-temporal attention (ZAttn) module. Similarly, the continuous decoder consists of a FAttn module and a proposed binary spatio-temporal attention (BAttn) module. STAR-transformer learns an efficient multi-feature representation of the spatio-temporal features by properly arranging pairings of the FAttn, ZAttn, and BAttn modules. Experimental results on the Penn-Action, NTU-RGB+D 60, and 120 datasets show that the proposed method achieves a promising improvement in performance in comparison to previous state-of-the-art methods.
Skeleton-based action recognition is an important task that requires the adequate understanding of movement characteristics of a human action from the given skeleton sequence. Recent studies have shown that exploring spatial and temporal features of the skeleton sequence is vital for this task. Nevertheless, how to effectively extract discriminative spatial and temporal features is still a challenging problem. In this paper, we propose a novel Attention Enhanced Graph Convolutional LSTM Network (AGC-LSTM) for human action recognition from skeleton data. The proposed AGC-LSTM can not only capture discriminative features in spatial configuration and temporal dynamics but also explore the co-occurrence relationship between spatial and temporal domains. We also present a temporal hierarchical architecture to increase temporal receptive fields of the top AGC-LSTM layer, which boosts the ability to learn the high-level semantic representation and significantly reduces the computation cost. Furthermore, to select discriminative spatial information, the attention mechanism is employed to enhance information of key joints in each AGC-LSTM layer. Experimental results on two datasets are provided: NTU RGB+D dataset and Northwestern-UCLA dataset. The comparison results demonstrate the effectiveness of our approach and show that our approach outperforms the state-of-the-art methods on both datasets.
Skeleton-based action recognition has attracted significant attention and obtained widespread applications due to the robustness of 3D skeleton data. One of the key challenges is how to extract discriminative and robust spatio-temporal features from sparse skeleton data to describe actions and improve recognition accuracy. To address this issue, this paper combines convolutions with attention mechanisms and proposes a deep network for skeleton-based action recognition, termed as local-and-global attention network (LAGA-Net). First, we encode skeleton sequences into joint feature evolution maps to compactly describe the spatial and temporal characteristics of skeleton sequences. Then, a motion guided channel attention module (MGCAM) is proposed to model the interdependencies between feature channels by calculating temporal frame-level motion and enhance motion-salient features in a channel-wise way. Further, a spatio-temporal attention module (STAM) is proposed to model spatio-temporal context-aware collaboration at sequence level and extract spatio-temporal attention features that involve long-range dependencies. Together, MGCAM and STAM are combined to form LAGA-Net, which extracts discriminative features integrating both local and global representations of skeleton sequences. Moreover, a two-stream architecture is proposed to learn complementary features from joint and bone aspects. We conduct extensive experiments to verify the effectiveness and superiority of our proposed method over state-of-the-art approaches on several benchmarks (e.g., NTU RGB+D, Northwestern-UCLA, UTD-MHAD and NTU RGB+D 120).
The skeleton-based human action recognition has broad application prospects in the field of virtual reality, as skeleton data is more resistant to data noise such as background interference and camera angle changes. Notably, recent works treat the human skeleton as a non-grid representation, e.g., skeleton graph, then learns the spatio-temporal pattern via graph convolution operators. Still, the stacked graph convolution plays a marginal role in modeling long-range dependences that may contain crucial action semantic cues. In this work, we introduce a skeleton large kernel attention operator (SLKA), which can enlarge the receptive field and improve channel adaptability without increasing too much computational burden. Then a spatiotemporal SLKA module (ST-SLKA) is integrated, which can aggregate long-range spatial features and learn long-distance temporal correlations. Further, we have designed a novel skeleton-based action recognition network architecture called the spatiotemporal large-kernel attention graph convolution network (LKA-GCN). In addition, large-movement frames may carry significant action information. This work proposes a joint movement modeling strategy (JMM) to focus on valuable temporal interactions. Ultimately, on the NTU-RGBD 60, NTU-RGBD 120 and Kinetics-Skeleton 400 action datasets, the performance of our LKA-GCN has achieved a state-of-the-art level.
… Abstract: Aiming at the problem that the existing human skeleton-based action recognition … the temporal and spatial characteristics of motion,a human skeleton-based action recognition …
… a skeleton action recognition model based on spatial–temporal graph attention networks (ST-… 1, at first, we construct each skeleton sequence into a spatiotemporal graph employing the …
… body is limited, the extra cost of applying self-attention mechanism is also relatively small. … spatial-temporal attention networks (DSTA-Net) for skeleton-based action recognition. It is …
Human action recognition has attracted considerable research attention in the field of computer vision, especially for classroom environments. However, most relevant studies have focused on one specific behavior of students. Therefore, this paper proposes a student behavior recognition system based on skeleton pose estimation and person detection. First, consecutive frames captured with a classroom camera were used as the input images of the proposed system. Then, skeleton data were collected using the OpenPose framework. An error correction scheme was proposed based on the pose estimation and person detection techniques to decrease incorrect connections in the skeleton data. The preprocessed skeleton data were subsequently used to eliminate several joints that had a weak effect on behavior classification. Second, feature extraction was performed to generate feature vectors that represent human postures. The adopted features included normalized joint locations, joint distances, and bone angles. Finally, behavior classification was conducted to recognize student behaviors. A deep neural network was constructed to classify actions, and the proposed system was able to identify the number of students in a classroom. Moreover, a system prototype was implemented to verify the feasibility of the proposed system. The experimental results indicated that the proposed scheme outperformed the skeleton-based scheme in complex situations. The proposed system had a 15.15% higher average precision and 12.15% higher average recall than the skeleton-based scheme did.
Aiming at the problems of insufficient professional teachers and backward teaching methods in tennis teaching in colleges and universities, this paper proposed a tennis action recognition and evaluation method based on skeleton key points. The YOLOv3 target detector with ResNet - 50 pose estimation model to extract human body skeleton key points, and coordinate system reconstruction method is proposed, the origin of coordinates transform for the halfway point of the left shoulder and right shoulder, effectively solve the image problem of the relative position of the human body. The tennis action da-ta set is fine-grained processed by Angle calculation, and the start and end points of the action are divided, so as to eliminate redundant and invalid frames in the data set. Then, using ST - GCN, AGCN, PoseC3D three kinds of gesture recognition model based on skeleton key training, the experimental results show that using fine-grained data sets, PoseC3D model accuracy increased from 69.17% to 88.33%. Finally, a multi-dimensional dynamic time warping evaluation algorithm based on the important key points and center of gravity of the human body was proposed, and the independent sample t-test verified that there was no significant difference between the score of the algorithm and the score of the coach (p=0.619>0.05). This study provides theoretical basis and technical support for the development of intelligent tennis teaching system.
… keypoint tracking, human detection, gaze direction estimation, and even advanced recognition of behavior … in attention measurement but also gives pose-based analysis, particularly in …
Security is a critical function for protecting target individuals from potential threats. In particular, safeguarding officials such as politicians requires significant resources to ensure their safety. In this paper, we investigate a real-time video surveillance system for safeguarding and develop an accurate and efficient human behavior detection method. After defining four target behaviors associated with potential threats, we construct a video dataset for these behaviors. Using sequences of estimated human poses, we then implement a lightweight human behavior detection method. Specifically, our approach combines convolutional layers with a Transformer encoder to capture both local and global features of human behavior. Experimental results demonstrate that our network achieves an average accuracy of 96.84% with an inference time of 2.1 milliseconds. We expect that the proposed method will significantly reduce the operational cost of video surveillance while maintaining effective detection of potential security threats.
Evaluating the quality of classroom teaching is crucial for modern education. Due to subjective assessment and qualitative analysis, traditional teaching evaluation methods often hard to provide comprehensive and objective data support. In this study, we propose an innovative model named PoseTeach, which is designed to automatically and quantitatively recognize teaching actions in the classroom from video stream. PoseTeach firstly detected the region of the teacher located by using Faster R-CNN and then estimated the pose with HRNet, finally recognized teacher's actions with PoseTeach. This model Leverages PoseC3D as the backbone network and integrates a Spatiotemporal Attention Mechanism (STAM) and a Dilated Convolution Module (DCM) to better capture the spatial and temporal information of teaching actions. A self-constructed dataset, TeachMove, which contains video clips of various classroom teaching behaviors, was used to train and test our model. On our self-constructed dataset, PoseTeach achieved an accuracy of 67.9% for Top-1 accuracy, with an average accuracy (acc/mean1) of 62.83%, outperforming the benchmark model 's 60% in both metrics. This experimental results also demonstrate that the proposed PoseTeach performs excellently across multiple evaluation metrics, surpassing traditional models such as ST-GCN, 2s-AGCN, and STGCN++.
Distracted driving is one of the primary causes of road traffic accidents. Behavior recognition technology based on machine vision has emerged as a research hotspot due to its non-contact and high-efficiency nature. To address the challenges of complex lighting conditions in the driver’s cabin, low detection accuracy for small-scale keypoints, and the difficulty in effectively characterizing behavioral features, this paper proposes a distracted driving behavior recognition method based on an improved YOLOv8n-Pose model and multi-feature fusion. First, the original YOLOv8n-Pose model is optimized. A P2 detection layer is added to enhance the feature extraction capabilities for small-scale human keypoints, and the SE attention module is incorporated to improve the model’s robustness under complex lighting conditions. In addition, the loss function is replaced with focal loss to tackle the class imbalance problem, thus forming the YOLOv8n-PSF-Pose keypoint detection network. Subsequently, based on the coordinates of 12 human keypoints extracted by this network, a multi-dimensional feature vector is constructed, which takes joint angles as the core and integrates the relative distances between keypoints and the number of valid keypoints. Finally, a BP neural network is adopted to classify the constructed feature vectors, enabling the accurate recognition of six typical distracted driving behaviors (normal driving, drinking or eating, making phone calls, using mobile phones, operating vehicle infotainment systems, and turning around to fetch items). The experimental results show that the improved YOLOv8n-PSF-Pose model achieves an mAP50 of 93.8% in keypoint detection, which is 6.7 percentage points higher than the original model; the BP classification model based on multi-feature fusion achieves an F1-score of 97.7% in the behavior recognition task, which is significantly better than traditional classifiers such as SVM and random forest, and the image processing speed on the NVIDIA RTX 3090TI reaches a high throughput of 45 FPS. This proves that the proposed method achieves an excellent balance between accuracy and speed. This study provides an effective solution for the real-time and accurate recognition of distracted driving behaviors.
… A summary of datasets for behavior recognition is shown in Table 1. Databases play at least … PA3D and I3D are highly complementary and superior to many other pose-based methods. …
This work targets human action recognition in video. While recent methods typically represent actions by statistics of local video features, here we argue for the importance of a representation derived from human pose. To this end we propose a new Pose-based Convolutional Neural Network descriptor (P-CNN) for action recognition. The descriptor aggregates motion and appearance information along tracks of human body parts. We investigate different schemes of temporal aggregation and experiment with P-CNN features obtained both for automatically estimated and manually annotated human poses. We evaluate our method on the recent and challenging JHMDB and MPII Cooking datasets. For both datasets our method shows consistent improvement over the state of the art.
In the fields of neuroscience and pharmacology, understanding rodent behavior is of vital importance for studying the effects of genetic operations and pharmacological therapies. Conventional behavior recognition methods based on raw images often struggle with noise, such as changes in the lighting conditions and the image backgrounds. On the other hand, pose-based approaches have demonstrated robustness against these challenges. However, existing methods rely on manually constructed features, which are time-consuming and may not fully exploit the potential of the pose data. In this work, we propose the hierarchical spatial–temporal window transformer network (HSTWFormer), a novel approach that efficiently extracts multiscale and cross-spacetime features from rodent pose data. By adopting a pure Transformer structure, HSTWFormer not only avoids the need for a predefined skeletal topology, but also enables adaptive recognition of interactive behaviors between multiple rodents. By merging the features of temporal neighbors, we construct a hierarchical structure with different receptive fields that retain essential information of all scales, enabling the extraction of semantic features from low to high level. Furthermore, a spatial–temporal window attention (STWA) block is introduced to capture correlations between different key points across frames. The STWA blocks facilitate the extraction of both short-term and long-term cross-spacetime features by enabling interactions between window information through window shifting, enhancing the network’s modeling performance. The effectiveness of the proposed HSTWFormer is demonstrated on two datasets, CRIM13 and CalMS21. We achieved accuracies of 79.3% and 69.8% for interactive and overall behaviors in the CRIM13 dataset, and 76.4% accuracy in the CalMS21 dataset. Our method harnesses the wealth of information embedded in key points, showcasing robust modeling capabilities for accurate rodent behavior recognition, and provides a novel and effective approach to assist researchers in neuroscience and pharmacology in better quantifying rodent behavior.
… lightweight action recognition network, namely MCANet. The network reduces the number of parameters in traditional transformers by improving the attention … of human action, with a …
While Graph Convolutional Networks (GCNs) have revolutionized skeleton-based action recognition, existing methods face a critical efficiency–accuracy dilemma: state-of-the-art approaches achieve high performance through computationally expensive multi-stream fusion (joint, bone, joint motion, and bone motion) and deep architectures, limiting real-world deployment on resource-constrained devices. We propose LST-AGCN (Lightweight Spatial–Temporal Attention Graph Convolutional Network), introducing three technical contributions that address this challenge: (1) Unified Attention Module (UAM)—a framework that integrates channel, spatial, and temporal attention through a single compact operation, significantly reducing attention parameters compared to separate attention mechanisms; (2) Depthwise Separable Attention Mechanism (DSAM)—a factorization using depthwise separable convolutions that achieves linear complexity reduction from O(C2) to O(C) in attention operations; and (3) Efficient Topology-Aware Fusion (ETAF)—an adaptive Joint-wise Attention strategy that captures fine-grained spatial relationships without quadratic complexity growth. Extensive experiments on NTU RGB+D 60 and NTU RGB+D 120 datasets demonstrate that LST-AGCN achieves strong performance using only joint modality (86.14%/94.0% and 79.5%/82.0% Top-1 accuracy with 99.0% Top-5 on cross-view) while requiring 14.11 M parameters and 19.02 GFLOPs, delivering efficient inference suitable for edge deployment.
To address the issue of excessive computational in transformer-based human action recognition models, this paper proposes a lightweight MotionBERT-based approach to human action recognition called light-MB. Firstly, we replace spatiotemporal attention modules with Focused Gating Attention Units (F-GAU) to reduce computational complexity while ensuring performance. Subsequently, we remove redundant multi-head attention in deeper layers and use a mixed approximate attention module to extract local and global features, further enhanced by inter-channel links to enhance information exchange. Finally, the Focal Loss is substituted for cross-entropy loss. Extensive experiments show that our light-MB achieves a 0.4 % accuracy improvement on the NTU-RGB + D120 (one-shot). The parameter size is only 3.67 % of the baseline, and FLOPs are reduced to 4.2 %.
The rise of online physical education in higher education has improved accessibility but presents challenges in recognizing complex movements and delivering individualized feedback. Existing action recognition models are often computationally intensive and struggle to generalize across diverse skeletal patterns. To address this, we propose a lightweight graph convolutional network (GCN) that integrates an improved Ghost module with multi-attention mechanisms, including a global attention mechanism (GAM) and a channel attention mechanism (CAM), to enhance spatial and temporal feature extraction. The model is trained end-to-end on 3D skeleton sequences and optimized for real-time efficiency. The computational cost is evaluated in terms of giga floating-point operations (GFLOPs), with the proposed model requiring only 6.2 GFLOPs per inference, over 60% less than the baseline ST-GCN. Experimental results on the NTU60RGB+D dataset demonstrate that the model achieves 90.8% accuracy in cross-subject and 96.8% in cross-view settings. These findings highlight the model’s effectiveness in balancing accuracy and efficiency, with promising applications in online physical education, rehabilitation monitoring, elderly movement analysis, and VR-based interfaces.
… pattern recognition. Nowadays, artificial intelligence (AI) based systems are needed for human-behavior assessment and security purposes. The existing action recognition techniques …
… attention module to assign different weights to each frame. We train the temporal attention … frames and constructs a lightweight network for action recognition. Experimental results …
Skeleton-based action recognition can achieve a relatively high performance by transforming the human skeleton structure in an image into a graph and applying action recognition based on structural changes in the body. Among the many graph convolutional network (GCN) approaches used in skeleton-based action recognition, semantic-guided neural networks (SGNs) are fast action recognition algorithms that hierarchically learn spatial and temporal features by applying a GCN. However, because an SGN focuses on global feature learning rather than local feature learning owing to the structural characteristics, there is a limit to an action recognition in which the dependency between neighbouring nodes is important. To solve these problems and simultaneously achieve a real-time action recognition in low-end devices, in this study, a single head attention (SHA) that can overcome the limitations of an SGN is proposed, and a new SGN-SHA model that combines SHA with an SGN is presented. In experiments on various action recognition benchmark datasets, the proposed SGN-SHA model significantly reduced the computational complexity while exhibiting a performance similar to that of an existing SGN and other state-of-the-art methods.
… For the problems of irrelevant frames and high model complexity in action recognition, we … Dual-Stage Attention Network (STHG-DAN) for multi-view data lightweight action recognition. It …
… lightweight human activity recognition model that integrates high-efficiency feature space and temporal information fusion with an attention … -known public Human Action Recognition (…
In-context learning provides a new perspective for multi-task modeling for vision and NLP. Under this setting, the model can perceive tasks from prompts and accomplish them without any extra task-specific head predictions or model fine-tuning. However, skeleton sequence modeling via in-context learning remains unexplored. Directly applying existing in-context models from other areas onto skeleton sequences fails due to the similarity between inter-frame and cross-task poses, which makes it exceptionally hard to perceive the task correctly from a subtle context. To address this challenge, we propose Skeleton-in-Context (SiC), an effective framework for in-context skeleton sequence modeling. Our SiC is able to handle multiple skeleton-based tasks simultaneously after a single training process and accomplish each task from context according to the given prompt. It can further generalize to new, unseen tasks according to customized prompts. To facilitate context perception, we additionally propose a task-unified prompt, which adaptively learns tasks of different natures, such as partial joint-level generation, sequence-level prediction, or 2D-to-3D motion prediction. We conduct extensive experiments to evaluate the effectiveness of our SiC on multiple tasks, including motion prediction, pose estimation, joint completion, and future pose estimation. We also evaluate its generalization capability on unseen tasks such as motion-in-between. These experiments show that our model achieves state-of-the-art multi-task performance and even outperforms single-task methods on certain tasks.
Using neural networks for learning motion controllers from motion capture data is becoming popular due to the natural and smooth motions they can produce, the wide range of movements they can learn and their compactness once they are trained. Despite these advantages, these systems require large amounts of motion capture data for each new character or style of motion to be generated, and systems have to undergo lengthy retraining, and often reengineering, to get acceptable results. This can make the use of these systems impractical for animators and designers and solving this issue is an open and rather unexplored problem in computer graphics. In this paper we propose a transfer learning approach for adapting a learned neural network to characters that move in different styles from those on which the original neural network is trained. Given a pretrained character controller in the form of a Phase‐Functioned Neural Network for locomotion, our system can quickly adapt the locomotion to novel styles using only a short motion clip as an example. We introduce a canonical polyadic tensor decomposition to reduce the amount of parameters required for learning from each new style, which both reduces the memory burden at runtime and facilitates learning from smaller quantities of data. We show that our system is suitable for learning stylized motions with few clips of motion data and synthesizing smooth motions in real‐time.
Few-shot action recognition aims to train a model to classify actions in videos using only a few examples, known as “shots,” per action class. This learning approach is particularly useful but challenging due to the limited availability of labeled video data in practice. Although significant progress has been made in developing few-shot learners, existing methods still face several limitations. Firstly, current methods have not sufficiently explored the effectiveness of 3D feature extractors (e.g., 3D CNNs or Video Transformers), thereby failing to exploit spatiotemporal dynamics in videos. Secondly, the need for a large video dataset to train the model in a centralized manner raises privacy concerns and results in high storage costs and communication overheads. Thirdly, the existing solutions based on local deployment lack the capability to benefit global prior knowledge from a wide variety of real-world action samples. To address these limitations, we propose a federated learning (FL) framework named FedFSLAR++ to collaboratively train few-shot learners with 3D feature extractors. Specifically, we perform few-shot action recognition tasks under FL settings, enhancing privacy protection while maintaining efficient communication and storage. Moreover, FL allows us to effectively learn meta-knowledge from a large set of action videos among heterogeneous clients. Within our framework, we establish a unified benchmark to systematically and fairly compare different components, including feature extraction, meta-learning, and FL for model update and aggregation. This type of benchmark is still lacking in the literature. Notably, we thoroughly examine six 3D CNN and Transformer models for extracting spatiotemporal video features needed to adapt to new tasks quickly during the meta-learning process. We further propose a hybrid feature extractor that combines the advantages of 3D CNNs and Transformers to produce strong video representations. Additionally, we explore three meta-learning paradigms and three FL algorithms to investigate their effectiveness and suggest the optimal choices for performance improvement. Results from extensive experiments on four action datasets verify the robustness of the FedFSLAR++ framework. Our comprehensive study provides a solid foundation for future research advancements in action recognition.
Self-supervised learning (SSL), which aims to learn meaningful prior representations from unlabeled data, has been proven effective for skeleton-based action understanding. Different from the image domain, skeleton data possesses sparser spatial structures and diverse representation forms, with the absence of background clues and the additional temporal dimension, presenting new challenges for spatial-temporal motion pretext task design. Recently, many endeavors have been made for skeleton-based SSL, achieving remarkable progress. However, a systematic and thorough review is still lacking. In this paper, we conduct, for the first time, a comprehensive survey on self-supervised skeleton-based action representation learning. Following the taxonomy of context-based, generative learning, and contrastive learning approaches, we make a thorough review and benchmark of existing works and shed light on the future possible directions. Remarkably, our investigation demonstrates that most SSL works rely on the single paradigm, learning representations of a single level, and are evaluated on the action recognition task solely, which leaves the generalization power of skeleton SSL models under-explored. To this end, a novel and effective SSL method for skeleton is further proposed, which integrates versatile representation learning objectives of different granularity, substantially boosting the generalization capacity for multiple skeleton downstream tasks. Extensive experiments under three large-scale datasets demonstrate our method achieves superior generalization performance on various downstream tasks, including recognition, retrieval, detection, and few-shot learning.
Human action recognition from video sequences is one of the most challenging computer vision applications, primarily owing to intrinsic variations in lighting, pose, occlusions, and other factors. The human skeleton joints extracted by the depth camera Kinect have the advantages of simplified structures and rich contents, and are therefore widely used for capturing human actions. However, at present, most of the skeletal joint and Deep learning based action recognition methods treat all skeletal joints equally in both spatial and temporal dimensions. Logically, this is not in accordance with the fact that for different human actions the contributions from skeletal joints could significantly vary spatially and temporally. Incorporating information pertaining to such natural variations will certainly aid in designing a robust human action recognitions system. Hence, in this work, we endeavor to propose a global spatial attention (GSA) model to suitably express the different skeletal joints with different weights so as to provide precise spatial information for human action recognition. Further, we will introduce the notion of accumulative learning curve (ALC) model that can highlight which frames contribute most to the final decision by giving varying temporal weights to each intermediate accumulated learning results provided by an LSTM upon input frames. The proposed GSA (for spatial information) and ALC (for temporal processing) models are integrated into the LSTM framework to construct a robust action recognition framework that takes the human skeletal joints as input and predicts the human action using the enhanced spatial-temporal attention model. Rigorous experiments on NTU datasets (by-far the largest benchmark RGB-D dataset) show that the proposed framework offers the best performance accuracy, least algorithmic complexity and training overheads, when compared with other state-of-the-art human action recognition models.
… problems in Euclidean space, and is not applicable in manifold space. To investigate these … a spatial and temporal attention mechanism on Lie groups for 3D human action recognition. …
Human pose estimation aims to locate the human body parts and build human body representation (e.g., body skeleton) from input data such as images and videos. It has drawn increasing attention during the past decade and has been utilized in a wide range of applications including human-computer interaction, motion analysis, augmented reality, and virtual reality. Although the recently developed deep learning-based solutions have achieved high performance in human pose estimation, there still remain challenges due to insufficient training data, depth ambiguities, and occlusion. The goal of this survey article is to provide a comprehensive review of recent deep learning-based solutions for both 2D and 3D pose estimation via a systematic analysis and comparison of these solutions based on their input data and inference procedures. More than 260 research papers since 2014 are covered in this survey. Furthermore, 2D and 3D human pose estimation datasets and evaluation metrics are included. Quantitative performance comparisons of the reviewed methods on popular datasets are summarized and discussed. Finally, the challenges involved, applications, and future research directions are concluded. A regularly updated project page is provided: https://github.com/zczcwh/DL-HPE.
… The field of human pose estimation has experienced significant advances with the … In the influential paper of [32] the pose estimation problem is turned into a body part classification …
Most video-based action recognition approaches choose to extract features from the whole video to recognize actions. The cluttered background and non-action motions limit the performances of these methods, since they lack the explicit modeling of human body movements. With recent advances of human pose estimation, this work presents a novel method to recognize human action as the evolution of pose estimation maps. Instead of relying on the inaccurate human poses estimated from videos, we observe that pose estimation maps, the byproduct of pose estimation, preserve richer cues of human body to benefit action recognition. Specifically, the evolution of pose estimation maps can be decomposed as an evolution of heatmaps, e.g., probabilistic maps, and an evolution of estimated 2D human poses, which denote the changes of body shape and body pose, respectively. Considering the sparse property of heatmap, we develop spatial rank pooling to aggregate the evolution of heatmaps as a body shape evolution image. As body shape evolution image does not differentiate body parts, we design body guided sampling to aggregate the evolution of poses as a body pose evolution image. The complementary properties between both types of images are explored by deep convolutional neural networks to predict action label. Experiments on NTU RGB+D, UTD-MHAD and PennAction datasets verify the effectiveness of our method, which outperforms most state-of-the-art methods.
Human Pose Estimation (HPE) is the task that aims to predict the location of human joints from images and videos. This task is used in many applications, such as sports analysis and surveillance systems. Recently, several studies have embraced deep learning to enhance the performance of HPE tasks. However, building an efficient HPE model is difficult; many challenges, like crowded scenes and occlusion, must be handled. This paper followed a systematic procedure to review different HPE models comprehensively. About 100 articles published since 2014 on HPE using deep learning were selected using several selection criteria. Both image and video data types of methods were investigated. Furthermore, both single and multiple HPE methods were reviewed. In addition, the available datasets, different loss functions used in HPE, and pretrained feature extraction models were all covered. Our analysis revealed that Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are the most used in HPE. Moreover, occlusion and crowd scenes remain the main problems affecting models’ performance. Therefore, the paper presented various solutions to address these issues. Finally, this paper highlighted the potential opportunities for future work in this task.
… human pose estimation, which aims to recognize the regular human activity from the video sequences. … Human pose relative distance and divergence measure are calculated between …
Human pose estimation localizes body keypoints to accurately recognizing the postures of individuals given an image. This step is a crucial prerequisite to multiple tasks of computer vision which include human action recognition, human tracking, human-computer interaction, gaming, sign languages, and video surveillance. Therefore, we present this survey article to fill the knowledge gap and shed light on the researches of 2D human pose estimation. A brief introduction is followed by classifying it as a single or multi-person pose estimation based on the number of people needed to be tracked. Then gradually the approaches used in human pose estimation are described before listing some applications and also flaws facing in pose estimation. Following that, a center of attention is given on briefly discussing researches with a significant effect on human pose estimation and examine the novelty, motivation, architecture, the procedures (working principles) of each model together with its practical application and drawbacks, datasets implemented, as well as the evaluation metrics used to evaluate the model. This review is presented as a baseline for newcomers and guides researchers to discover new models by observing the procedure and architecture flaws of existing researches.
… field of image classification, human pose detection, action recognition, etc. Ridge regression is another classification algorithm that showcases the different actions in a video sequence. …
… for the classification and interpretation of human motion … human pose estimation and time series classification approaches. The data capture with video, combined with pose estimation …
This paper suggests that human pose estimation (HPE) and sustainable event classification (SEC) require an advanced human skeleton and context-aware features extraction approach along with machine learning classification methods to recognize daily events precisely. Over the last few decades, researchers have found new mechanisms to make HPE and SEC applicable in daily human life-log events such as sports, surveillance systems, human monitoring systems, and in the education sector. In this research article, we propose a novel HPE and SEC system for which we designed a pseudo-2D stick model. To extract full-body human silhouette features, we proposed various features such as energy, sine, distinct body parts movements, and a 3D Cartesian view of smoothing gradients features. Features extracted to represent human key posture points include rich 2D appearance, angular point, and multi-point autocorrelation. After the extraction of key points, we applied a hierarchical classification and optimization model via ray optimization and a K-ary tree hashing algorithm over a UCF50 dataset, an hmdb51 dataset, and an Olympic sports dataset. Human body key points detection accuracy for the UCF50 dataset was 80.9%, for the hmdb51 dataset it was 82.1%, and for the Olympic sports dataset it was 81.7%. Event classification for the UCF50 dataset was 90.48%, for the hmdb51 dataset it was 89.21%, and for the Olympic sports dataset it was 90.83%. These results indicate better performance for our approach compared to other state-of-the-art methods.
… learning-based 2D/3D human posture estimation strategies from images or video recordings. … more different problem axes on which human pose estimation models can be classified- …
This paper presents an advanced approach to Human Pose Estimation (HPE) and Semantic Event Classification (SEC), emphasizing the need for sophisticated human skeleton models, context-aware feature extraction, and machine learning techniques for precise event recognition in daily life logs. HPE, crucial in applications like sports analysis and surveillance systems, involves predicting human joint locations from images and videos. Recent deep learning advancements have significantly improved HPE, particularly in crowded scenes and occlusion challenges. Despite many surveys, a comprehensive review of HPE, especially with recent deep learning innovations, is still needed. Our research addresses this by proposing a novel HPE and SEC system. The system begins with preprocessing steps, including converting videos into image sequences, applying sliding window techniques, and converting images to grayscale, then extracting human silhouettes using binary masks. We use the GrabCut algorithm for human detection and perform skeletonization with Hough transform algorithm. Keypoint detection is achieved through pose estimation, and full-body feature extraction includes using OpenPose for movable body parts, the Lucas-Kanade method for a 3D Cartesian view, and Texton Map techniques. Key point features are further characterized using motion histograms, pose landmark visualization and Local Intensity Order Pattern (LIOP) features. The system is optimized with adaptive moment estimations and classified using the XGBoost Classifier. Evaluation on the COCO, UCF50, and YouTube datasets showed classification accuracies of 92.90%, 90.9%, and 91.2%, respectively, demonstrating our approach’s superior performance and effectiveness compared to existing state-of-the-art techniques.
In recent years, aging of population and empty nest problem are becoming more and more severe. In addition, fall is the leading cause of death for seniors both in China and the U.S. Therefore, automatic fall detection for seniors is required in smart home and smart healthcare system. Currently, for its convenience and low cost, video-based method is the optimal method compared with other methods such as wearable sensor and ambient sensor in the field of indoor fall detection. In this paper, we propose a novel 2D video-based fall detection pipeline with human pose estimation. Firstly, we used OpenPose to extract the positions of human joints in raw data. Secondly, these data with augmented features became the input of a convolution neural network so that we can extract multi-layered features. Thirdly, a binary classification was conducted through neural network. For comparison, we also used SVM as the classifier. At last, we achieved relatively high sensitivity and specificity when compared our results with other state-of-the-art approaches on three public fall datasets.
… in this area include body pose estimation, hand pose estimation and head pose estimation. … Li, “Inference of human postures by classification of 3d human body shape,” in AMFG, 2003. …
本报告将人体行为识别领域的研究整合为四个核心维度:基于YOLOv8的姿态估计基础框架、时空注意力机制与特征建模、小样本学习与泛化策略、以及轻量化模型与实际应用系统。该分类体系系统性地涵盖了从关键点提取、动态特征增强到小样本训练及工程化落地的全流程,为解决小样本条件下人体行为检测的挑战提供了坚实的理论与技术支撑。