Human Pose Estimation is a computer vision technique used to localize anatomical keypoints in images or video. These points represent joints and body landmarks, enabling the construction of a skeletal model based on 2D or 3D coordinates.
Detection Architectures
Top-Down Approach
In this approach, individuals are first detected using bounding boxes, and subsequently, the pose of each person is estimated independently. This method offers high per-subject precision, but its computational cost increases proportionally with the number of people in the scene.
Bottom-Up Approach
The model detects all keypoints simultaneously and then groups them by individual using association algorithms, such as Part Affinity Fields (PAFs). This approach maintains stable latency in multi-person scenes, though it requires robust assignment mechanisms to link joints correctly.
Prediction Methods
Heatmaps
The network generates probability distributions for each joint. Heatmaps provide superior spatial precision and robustness against occlusions, albeit at the cost of higher memory and computational consumption.
Direct Regression
The model directly predicts the coordinates (x, y) or (x, y, z). This significantly reduces latency and model size, making it ideal for edge computing, although it tends to be less robust in complex or highly cluttered scenarios.
Models
- MediaPipe: Optimized for real-time inference on mobile devices and web browsers.
- OpenPose: The academic benchmark for multi-person bottom-up pose estimation.
- YOLOv8: Integrates object detection and pose estimation into a high-speed, single-shot model.
- DeepLabCut: Specifically designed for biomedical research and animal behavioral analysis.
3D Estimation
3D estimation incorporates the depth coordinate. Through 3D lifting techniques, deep neural networks infer the $Z$ component from 2D poses by learning kinematic constraints. This allows for full spatial reconstruction without the need for specialized infrared sensors.
Applications
In the healthcare sector, this technology is essential for monitoring rehabilitation processes through range-of-motion analysis and automatic fall detection for the elderly. Its impact on elite sports has been solidified following its implementation at the 2026 Winter Olympics, enabling precise kinematic analysis to optimize performance and prevent injuries. Likewise, in the industrial sector, it facilitates automated ergonomic workplace assessments. AI surpasses simple visual analysis by converting video sequences into structured biomechanical data, ready for integration into advanced analytics ecosystems.
Technical References
- OpenPose (Bottom-Up Architecture): Cao, Z., et al. (2019). “OpenPose: Realtime Multi-Person 2D Pose Estimation”. arXiv:1812.08008
- MediaPipe (Real-time Inference): Bazarevsky, V., et al. (2020). “BlazePose: On-device Real-time Body Pose tracking”. arXiv:2006.10204
- 3D Lifting (2D to 3D): Martinez, J., et al. (2017). “A simple yet effective baseline for 3d human pose estimation”. arXiv:1705.03090
- DeepLabCut (Biometric Research): Mathis, A., et al. (2018). “DeepLabCut: markerless pose estimation”. Nature Neuroscience.
- YOLOv8-Pose (Single-shot Detection): Ultralytics (2023). “YOLOv8 Pose Estimation Documentation”. docs.ultralytics.com
- HRNet (High-Res Heatmaps): Sun, K., et al. (2019). “Deep High-Resolution Representation Learning for Human Pose Estimation”. arXiv:1902.09212
- Rodríguez Beceiro, P. (2023). “AI pose detection applied to biomechanical analysis”. UPM Digital Archive.


