期刊目次

加入编委

期刊订阅

添加您的邮件地址以接收即将发行期刊数据:

Open Access Article

Advances in International Computer Science. 2025; 5: (1) ; 1-9 ; DOI: 10.12208/j.aics.20250001.

Overview on human pose estimation based on deep learning
基于深度学习的人体姿态估计综述

作者: 江云, 刘述民 *

江西理工大学软件工程学院 江西南昌

*通讯作者: 刘述民,单位:江西理工大学软件工程学院 江西南昌;

发布时间: 2025-04-18 总浏览量: 95

摘要

人体姿态估计作为行为识别、行为检测的基础,是机器视觉领域的一个具有挑战的任务。近年来,随着深度学习的发展,基于深度学习的人体姿态估计算法展现出了非常优异的效果,并成为学者关注和研究的热点。本文首先将基于深度学习的人体姿态估计分为单人姿态估计、多人姿态估计两类;其次,分别介绍了近年来这两类人体姿态估计的发展,对比分析了各类算法的特性;再次,介绍了姿态估计常用数据集以及评价指标;最后,讨论了当前基于深度学习的人体姿态估计所面临的困难和挑战,并对未来发展趋势进行了展望。

关键词: 机器视觉;深度学习;人体姿态估计;关键点检测

Abstract

Human pose estimation, serving as the foundation for action recognition and detection, is a challenging task in the field of machine vision. With the rapid development of deep learning in recent years, deep learning-based human pose estimation algorithms have achieved remarkable performance and have become a focal point of academic research. This paper first classifies deep learning-based human pose estimation into two categories: single-person pose estimation and multi-person pose estimation. It then reviews the development of these two categories in recent years, providing a comparative analysis of the characteristics of various algorithms. Additionally, the paper introduces commonly used datasets and evaluation metrics for pose estimation. Finally, it discusses the challenges faced by current deep learning-based human pose estimation systems and offers an outlook on future research directions.

Key words: Machine vision; deep learning; Human pose estimation; Key point detection

参考文献 References

[1] Witkin A. Scale-space filtering: A new approach to multi-scale description[C]//ICASSP'84. IEEE international conference on acoustics, speech, and signal processing. IEEE, 1984, 9: 150-153.

[2] Dalal N, Triggs B. Histograms of oriented gradients for human detection[C]//2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05). Ieee, 2005, 1: 886-893.

[3] Toshev, Alexander and Christian Szegedy. “DeepPose: Human Pose Estimation via Deep Neural Networks.”[J] 2014 IEEE Conference on Computer Vision and Pattern Recognition (2013): 1653-1660.

[4] Li J, Chen T, Shi R, et al. Localization with sampling-argmax[J]. Advances in Neural Information Processing Systems, 2021, 34: 27236-27248.

[5] Wei S E, Ramakrishna V, Kanade T, et al. Convolutional pose machines[C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2016: 4724-4732.

[6] Xu T, Takano W. Graph stacked hourglass networks for 3d human pose estimation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 16105-16114.

[7] Wang J, Sun K, Cheng T, et al. Deep high-resolution representation learning for visual recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2020, 43(10): 3349-3364.

[8] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.

[9] Yu C, Xiao B, Gao C, et al. Lite-hrnet: A lightweight high-resolution network[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 10440-10450.

[10] Wang Y, Li M, Cai H, et al. Lite pose: Efficient architecture design for 2d human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 13126-13136.

[11] Wang J, Long X, Gao Y, et al. Graph-pcnn: Two stage human pose estimation with graph pose refinement[C]// Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. Springer International Publishing, 2020: 492-508.

[12] Li Y, Zhang S, Wang Z, et al. Tokenpose: Learning keypoint tokens for human pose estimation[C]//Proceedings of the IEEE/CVF International conference on computer vision. 2021: 11313-11322.

[13] Li W, Liu M, Liu H, et al. Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 604-613.

[14] Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2014: 580-587.

[15] Wang A, Chen H, Liu L, et al. Yolov10: Real-time end-to-end object detection[J]. arXiv preprint arXiv:2405.14458, 2024.

[16] Liu W, Anguelov D, Erhan D, et al. Ssd: Single shot multibox detector[C]//Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer International Publishing, 2016: 21-37.

[17] He K, Gkioxari G, Dollár P, et al. Mask r-cnn[C]// Proceedings of the IEEE international conference on computer vision. 2017: 2961-2969.

[18] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(6): 1137-1149.

[19] Fang H S, Li J, Tang H, et al. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(6): 7157-7173.

[20] Chen Y, Wang Z, Peng Y, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7103-7112.

[21] Cao Z, Simon T, Wei S E, et al. Realtime multi-person 2d pose estimation using part affinity fields[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2017: 7291-7299.

[22] Simonyan K. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.

[23] Osokin D. Real-time 2d multi-person pose estimation on cpu: Lightweight openpose[J]. arXiv preprint arXiv:1811.12004, 2018.

[24] Howard A G. Mobilenets: Efficient convolutional neural networks for mobile vision applications[J]. arXiv preprint arXiv:1704.04861, 2017.

[25] Maji D, Nagori S, Mathew M, et al. Yolo-pose: Enhancing yolo for multi person pose estimation using object keypoint similarity loss[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 2637-2646.

[26] Nguyen H C, Nguyen T H, Scherer R, et al. Unified end-to-end YOLOv5-HR-TCM framework for automatic 2D/3D human pose estimation for real-time applications[J]. Sensors, 2022, 22(14): 5419.

[27] Wang C Y, Bochkovskiy A, Liao H Y M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 7464-7475.

[28] 傅裕,高树辉.改进YOLOv8s-Pose多人姿态估计轻量化模型研究[J/OL].计算机科学与探索,1-17[2025-01-16]. http://kns.cnki.net/kcms/detail/11.5602.TP.20240507.1148.002.html.

[29] 方晓柯,黄俊.基于yolov8-pose的人体姿态检测模型[J/OL].激光杂志,1-9[2025-01-17]. 

http://kns.cnki.net/kcms/detail/50.1085.tn.20240902.1533.007.html.

[30] Zhu X, Hu H, Lin S, et al. Deformable convnets v2: More deformable, better results[C]//Proceedings of the IEEE/ CVF conference on computer vision and pattern recognition. 2019: 9308-9316.

[31] Doherty J, Gardiner B, Kerr E, et al. BiFPN-YOLO: One-stage object detection integrating Bi-Directional Feature Pyramid Networks[J]. Pattern Recognition, 2025, 160: 111209.

[32] 罗智杰,王泽宇,岑飘,等.基于改进YOLOv8pose的校园体测运动姿势识别研究[J].电子测量技术,2024,47(19): 24-33.

[33] Yu Z, Huang H, Chen W, et al. Yolo-facev2: A scale and occlusion aware face detector[J]. Pattern Recognition, 2024, 155: 110714.

[34] Wu T, Tang S, Zhang R, et al. A light-weight context guided network for semantic segmentation., 2020, 30[J]. DOI: https://doi. org/10.1109/TIP, 2020: 1169-1179.

[35] Yu H, Wan C, Liu M, et al. Real-Time Image Segmentation via Hybrid Convolutional-Transformer Architecture Search[J]. arXiv preprint arXiv:2403.10413, 2024.

[36] Ionescu C, Papava D, Olaru V, et al. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments[J]. IEEE transactions on pattern analysis and machine intelligence, 2013, 36(7): 1325-1339.

[37] Johnson S, Everingham M. Clustered pose and nonlinear appearance models for human pose estimation[C]//bmvc. 2010, 2(4): 5.

[38] Sapp B, Taskar B. Modec: Multimodal decomposable models for human pose estimation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2013: 3674-3681.

[39] Wang J, Yang F, Gou W, et al. Freeman: Towards benchmarking 3d human pose estimation in the wild[J]. arXiv preprint arXiv:2309.05073, 2023.

[40] Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. Springer International Publishing, 2014: 740-755.

[41] Lin W, Liu H, Liu S, et al. HiEve: A large-scale benchmark for human-centric video analysis in complex events[J]. International Journal of Computer Vision, 2023, 131(11): 2994-3018.

[42] Andriluka M, Pishchulin L, Gehler P, et al. 2d human pose estimation: New benchmark and state of the art analysis[C]//Proceedings of the IEEE Conference on computer Vision and Pattern Recognition. 2014: 3686-3693.

[43] Wu J, Zheng H, Zhao B, et al. Ai challenger: A large-scale dataset for going deeper in image understanding[J]. arXiv preprint arXiv:1711.06475, 2017.

[44] Li J, Wang C, Zhu H, et al. Crowdpose: Efficient crowded scenes pose estimation and a new benchmark[C]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 10863-10872.

[45] Andriluka M, Iqbal U, Insafutdinov E, et al. Posetrack: A benchmark for human pose estimation and tracking[C]// Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 5167-5176.

引用本文

江云, 刘述民, 基于深度学习的人体姿态估计综述[J]. 国际计算机科学进展, 2025; 5: (1) : 1-9.