Human motion capture traditionally relies on marker-based optical systems (Vicon, OptiTrack) or dense sensor suits, both requiring controlled environments and expensive infrastructure. Sparse inertial motion capture has emerged as a compelling alternative, using only 6 Inertial Measurement Units (IMUs) placed at key body locations to reconstruct full-body pose in unconstrained environments (Yi, Zhou, and Xu 2021). Unlike vision-based methods, IMU-based approaches are immune to occlusion and work in any lighting condition, though they face challenges from sensor drift, noise, and the inherent ambiguity of mapping sparse measurements to full-body pose.
Early learning-based methods applied recurrent neural networks to model temporal dependencies in IMU sequences, but struggled with physical plausibility; producing motions with floating, sliding, or ground penetration artifacts. Physical Inertial Poser (PIP) (Yi et al. 2022) introduced physics-aware optimization to enforce ground contact constraints and biomechanical feasibility, combining neural kinematics estimation with physics-based refinement. Physical Non-inertial Poser (PNP) (Yi, Zhou, and Xu 2024) extended this by modeling fictitious forces in non-inertial reference frames, addressing errors that arise when the pelvis (root) undergoes significant acceleration or rotation.
Calibration presents a fundamental challenge: sensors must be precisely aligned to body segments, and magnetometer drift causes heading errors over time. Traditional calibration requires static poses (T-pose, N-pose), breaking workflow and failing when sensors shift. Transformer IMU Calibrator (TIC) (Zuo et al. 2025) achieves dynamic, implicit calibration by learning to estimate calibration matrices from diverse motion patterns, enabling seamless "put on and use" operation without explicit calibration procedures.
Multi-modal fusion enhances accuracy by combining IMU with complementary sensors. DiffCap (Pan et al. 2025) fuses sparse IMUs with monocular camera using diffusion models, where visual information provides dense constraints when available and IMU ensures robustness during occlusion. BaroPoser (Zhang, Yi, and Xu 2025) incorporates barometric pressure from everyday devices (smartphones, smartwatches) to estimate height changes, enabling motion capture on non-flat terrain. Ultra Inertial Poser (Armani et al. 2024) adds ultra-wideband (UWB) ranging between sensors, providing absolute inter-sensor distances that dramatically reduce drift and jitter.
Recent work addresses practical deployment challenges. Loose Inertial Poser (Zuo et al. 2024) enables motion capture from sensors embedded in loose-fitting clothing by modeling secondary motion effects. MagShield (Shao et al. 2025) detects and corrects magnetic disturbances that corrupt orientation estimates in real-world environments. Group Inertial Poser (Xue et al. 2025) extends to multi-person tracking by leveraging inter-person UWB distances for relative positioning.
The field has converged on: (1) deep neural networks for learning motion priors from large datasets, (2) physics-based refinement for physical plausibility, (3) multi-modal fusion for enhanced accuracy, and (4) dynamic calibration for practical deployment. Global translation estimation; particularly in the vertical direction; remains challenging, with physics-based contact reasoning (Yi, Pan, and Xu 2025) providing the current best solution.