Abstract
In this paper, we introduce a method to automatically reconstruct the 3D motion of a person interacting with an object from a single RGB video. Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces exerted on the human body. The main contributions of this work are three-fold. First, we introduce an approach to jointly estimate the motion and the actuation forces of the person on the manipulated object by modeling contacts and the dynamics of the interactions. This is cast as a large-scale trajectory optimization problem. Second, we develop a method to automatically recognize from the input video the 2D position and timing of contacts between the person and the object or the ground, thereby significantly simplifying the complexity of the optimization. Third, we validate our approach on a recent video + MoCap dataset capturing typical parkour actions, and demonstrate its performance on a new dataset of Internet videos showing people manipulating a variety of tools in unconstrained environments.
Similar content being viewed by others
Notes
In this paper, trajectories are denoted as underlined variables, e.g. \({\underline{x}},{\underline{u}}~\text {or}~{\underline{c}}\).
Spatial velocities (accelerations) are minimal and unified representations of linear and angular velocities (accelerations) of a rigid body (Featherstone 2008). They are of dimension 6.
References
Abdulla, W. (2017). Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. https://github.com/matterport/Mask_RCNN
Agarwal, S., Mierle, K., et al. (2012). Ceres solver. http://ceres-solver.org
Akhter, I., & Black, M. J. (2015). Pose-conditioned joint angle limits for 3d human pose reconstruction. In CVPR
Alayrac, J. B., Bojanowski, P., Agrawal, N., Laptev, I., Sivic, J., & Lacoste-Julien, S. (2016). Unsupervised learning from narrated instruction videos. In CVPR.
Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2d human pose estimation: New benchmark and state of the art analysis. In CVPR.
Biegler, L. T. (2010). Nonlinear programming: Concepts, algorithms, and applications to chemical processes (Vol. 10, Chap 10). SIAM.
Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., & Black, M. J. (2016). Keep it SMPL: Automatic estimation of 3d human pose and shape from a single image. In ECCV.
Boulic, R., Thalmann, N. M., & Thalmann, D. (1990). A global human walking model with real-time kinematic personification. The Visual Computer, 6(6), 344–358. https://doi.org/10.1007/BF01901021
Bourdev, L., & Malik, J. (2011). The human annotation tool. https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/shape/hat/
Brachmann, E., Michel, F., Krull, A., Ying Yang, M., Gumhold, S., et al. (2016). Uncertainty-driven 6d pose estimation of objects and scenes from a single rgb image. In CVPR.
Brubaker, M. A., Fleet, D. J., & Hertzmann, A. (2007). Physics-based person tracking using simplified lower-body dynamics. In CVPR.
Brubaker, M. A., Sigal, L., & Fleet, D. J. (2009). Estimating contact dynamics. In CVPR.
Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In CVPR.
Carpentier, J., & Mansard, N. (2018a). Analytical derivatives of rigid body dynamics algorithms. In Robotics: Science and Systems.
Carpentier, J., & Mansard, N. (2018b). Multi-contact locomotion of legged robots. IEEE Transactions on Robotics.
Carpentier, J., Valenza, F., Mansard, N., et al. (2015–2019). Pinocchio: Fast forward and inverse dynamics for poly-articulated systems. https://stack-of-tasks.github.io/pinocchio
Carpentier, J., Del Prete, A., Tonneau, S., Flayols, T., Forget, F., Mifsud, A., et al. (2017). Multi-contact locomotion of legged robots in complex environments-the loco3d project. In RSS workshop on challenges in dynamic legged locomotion (p. 3p).
Carpentier, J., Saurel, G., Buondonno, G., Mirabel, J., Lamiraux, F., Stasse, O., & Mansard, N. (2019). The pinocchio c++ library—A fast and flexible implementation of rigid body dynamics algorithms and their analytical derivatives. In IEEE international symposium on system integrations (SII).
Chen, C. H., & Ramanan, D. (2017). 3d human pose estimation= 2d pose estimation+ matching. In CVPR.
Delaitre, V., Sivic, J., & Laptev, I. (2011). Learning person–object interactions for action recognition in still images. In NIPS.
Diehl, M., Bock, H., Diedam, H., & Wieber, P. B. (2006). Fast direct multiple shooting algorithms for optimal robot control. In Fast motions in biomechanics and robotics. Springer.
Doumanoglou, A., Kouskouridas, R., Malassiotis, S., & Kim, T. K. (2016). 6D object detection and next-best-view prediction in the crowd. In CVPR.
Featherstone, R. (2008). Rigid body dynamics algorithms. Berlin: Springer.
Fouhey, D. F., Delaitre, V., Gupta, A., Efros, A. A., Laptev, I., & Sivic, J. (2014). People watching: Human actions as a cue for single view geometry. IJCV, 110(3), 259–274.
Gall, J., Rosenhahn, B., Brox, T., & Seidel, H. P. (2010). Optimization and filtering for human motion capture. International Journal of Computer Vision, 87(1–2), 75.
Gammeter, S., Ess, A., Jäggli, T., Schindler, K., Leibe, B., & Van Gool, L. (2008). Articulated multi-body tracking under egomotion. In ECCV.
Gower, J. C. (1975). Generalized procrustes analysis. Psychometrika, 40(1), 33–51.
Grabner, A., Roth, P. M., & Lepetit, V. (2018). 3D pose estimation and 3D model retrieval for objects in the wild. In CVPR.
Gupta, A., Kembhavi, A., & Davis, L. S. (2009). Observing human–object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10), 1775–1789.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. B. (2017). Mask R-CNN. CoRR. arXiv:1703.06870
Herdt, A., Perrin, N., & Wieber, P. B. (2010). Walking without thinking about it. In International Conference on Intelligent Robots and Systems (IROS). https://doi.org/10.1109/IROS.2010.5654429
Hinterstoisser, S., Lepetit, V., Rajkumar, N., & Konolige, K. (2016). Going further with point pair features. In ECCV.
Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., & Schiele, B. (2016). Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV.
Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1325–1339.
Jiang, Y., Koppula, H., & Saxena, A. (2013). Hallucinated humans as the hidden context for labeling 3d scenes. In CVPR.
Kanazawa, A., Black, M. J., Jacobs, D. W., & Malik, J. (2018). End-to-end recovery of human shape and pose. In CVPR.
Kanazawa, A., Zhang, J. Y., Felsen, P., & Malik, J. (2019). Learning 3d human dynamics from video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5614–5623).
Kocabas, M., Athanasiou, N., & Black, M. J. (2020). Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5253–5263).
Kuffner, J., Nishiwaki, K., Kagami, S., Inaba, M., & Inoue, H. (2005). Motion planning for humanoid robots. In Robotics research. The eleventh international symposium.
Li, Y., Wang, G., Ji, X., Xiang, Y., & Fox, D. (2018). DeepIM: Deep iterative matching for 6D pose estimation. In ECCV.
Li, Z., Sedlar, J., Carpentier, J., Laptev, I., Mansard, N., & Sivic, J. (2019). Estimating 3d motion and forces of person–object interactions from monocular video. In Computer vision and pattern recognition (CVPR).
Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J., et al. (2014). Microsoft COCO: common objects in context. CoRRarXiv:1405.0312
Loing, V., Marlet, R., & Aubry, M. (2018). Virtual training for a real application: Accurate object-robot relative localization without calibration. In IJCV. https://doi.org/10.1007/s11263-018-1102-6
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2015). SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6), 248.
Loper, M. M., Mahmood, N., & Black, M. J. (2014). MoSh: Motion and shape capture from sparse markers. ACM Transactions on Graphics (Proc SIGGRAPH Asia), 33(6), 220:1-220:13. https://doi.org/10.1145/2661229.2661273
Maldonado, G. (2018). Some biomechanical and robotic models. https://github.com/GaloMALDONADO/Models
Maldonado, G., Bailly, F., Souères, P., & Watier, B. (2017). Angular momentum regulation strategies for highly dynamic landing in Parkour. Computer Methods in Biomechanics and Biomedical Engineering, 20(sup1), 123–124. https://doi.org/10.1080/10255842.2017.1382892, https://hal.archives-ouvertes.fr/hal-01636353
Malmaud, J., Huang, J., Rathod, V., Johnston, N., Rabinovich, A., & Murphy, K. (2015). What’s cookin’? interpreting cooking videos using text, speech and vision. arXiv preprint arXiv:1503.01558
Marinoiu, E., Papava, D., & Sminchisescu, C. (2013). Pictorial human spaces: How well do humans perceive a 3d articulated pose? In Proceedings of the IEEE International Conference on Computer Vision (pp. 1289–1296).
Martinez, J., Hossain, R., Romero, J., & Little, J. J. (2017). A simple yet effective baseline for 3d human pose estimation. In ICCV.
Mordatch, I., Todorov, E., & Popović, Z. (2012). Discovery of complex behaviors through contact-invariant optimization. ACM Transactions on Graphics (TOG), 31(4), 43.
Moreno-Noguer, F. (2017). 3d human pose estimation from a single image via distance matrix regression. In CVPR.
Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV.
Newell, A., Huang, Z., & Deng, J. (2017). Associative embedding: End-to-end learning for joint detection and grouping. In NIPS.
Oberweger, M., Rad, M., & Lepetit, V. (2018). Making deep heatmaps robust to partial occlusions for 3D object pose estimation. In ECCV.
Pavlakos, G., Zhou, X., Derpanis, K. G., & Daniilidis, K. (2017). Coarse-to-fine volumetric prediction for single-image 3d human pose. In CVPR.
Posa, M., Cantu, C., & Tedrake, R. (2014). A direct method for trajectory optimization of rigid bodies through contact. The International Journal of Robotics Research, 33(1), 69–81.
Prest, A., Ferrari, V., & Schmid, C. (2013). Explicit modeling of human–object interactions in realistic videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(4), 835–848.
Project webpage. (2021). https://www.di.ens.fr/willow/research/motionforcesfromvideo/
Rad, M., & Lepetit, V. (2017). Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In ICCV.
Rad, M., Oberweger, M., & Lepetit, V. (2018). Feature mapping for learning fast and accurate 3D pose inference from synthetic images. In CVPR.
Rempe, D., Guibas, L. J., Hertzmann, A., Russell, B., Villegas, R., & Yang, J. (2020). Contact and human dynamics from monocular video. In European conference on computer vision (pp. 71–87). Springer.
Schultz, G., & Mombaur, K. (2010). Modeling and optimal control of human-like running. IEEE/ASME Transactions on Mechatronics, 15(5), 783–792.
Shimada, S., Golyanik, V., Xu, W., & Theobalt, C. (2020). Physcap: Physically plausible monocular 3d motion capture in real time. ACM Transactions on Graphics (TOG), 39(6), 1–16.
Sidenbladh, H., Black, M. J., Fleet, D. J. (2000). Stochastic tracking of 3d human figures using 2d image motion. In ECCV.
Tassa, Y., Erez, T., & Todorov, E. (2012). Synthesis and stabilization of complex behaviors through online trajectory optimization. In IEEE international conference on intelligent robots and systems (IROS). https://doi.org/10.1109/IROS.2012.6386025
Taylor, C. J. (2000). Reconstruction of articulated objects from point correspondences in a single uncalibrated image. Computer Vision and Image Understanding, 80(3), 349–363.
Tejani, A., Tang, D., Kouskouridas, R., & Kim, T. K. (2014). Latent-class hough forests for 3d object detection and pose estimation. In ECCV.
Tekin, B., Rozantsev, A., Lepetit, V., & Fua, P. (2016). Direct prediction of 3d body poses from motion compensated sequences. In CVPR.
Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. CoRR. arXiv:1703.06907
Tonneau, S., Del Prete, A., Pettré, J., Park, C., Manocha, D., & Mansard, N. (2018a). An efficient acyclic contact planner for multiped robots. IEEE Transactions on Robotics (TRO). https://doi.org/10.1109/TRO.2018.2819658
Tonneau, S., Del Prete, A., Pettré, J., Park, C., Manocha, D., & Mansard, N. (2018b). An efficient acyclic contact planner for multiped robots. IEEE Transactions on Robotics, 34(3), 586–601.
Triggs, B., McLauchlan, P. F., Hartley, R. I., Fitzgibbon, A. W. (1999). Bundle adjustment—A modern synthesis. In International workshop on vision algorithms.
Wei, X., & Chai, J. (2010). Videomocap: Modeling physically realistic human motion from monocular video sequences. ACM Transactions on Graphics, 29(4), 42:1-42:10. https://doi.org/10.1145/1778765.1778779
Westervelt, E. R., Grizzle, J. W., & Koditschek, D. E. (2003). Hybrid zero dynamics of planar biped walkers. IEEE Transactions on Automatic Control, 48(1), 42–56. https://doi.org/10.1109/TAC.2002.806653
Winkler, A. W., Bellicoso, C. D., Hutter, M., & Buchli, J. (2018). Gait and trajectory optimization for legged systems through phase-based end-effector parameterization. IEEE Robotics and Automation Letters, 3(3), 1560–1567.
Xiang, D., Joo, H., & Sheikh, Y. (2019). Monocular total capture: Posing face, body, and hands in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10965–10974).
Xiang, Y., Schmidt, T., Narayanan, V., & Fox, D. (2017). Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. CoRR. arXiv:1711.00199
Yao, B., & Fei-Fei, L. (2012). Recognizing human–object interactions in still images by modeling the mutual context of objects and human poses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1691–1703.
Zanfir, A., Marinoiu, E., & Sminchisescu, C. (2018). Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2148–2157).
Zhou, X., Zhu, M., Leonardos, S., Derpanis, K. G., & Daniilidis, K. (2016). Sparseness meets deepness: 3d human pose estimation from monocular video. In CVPR.
Acknowledgements
We thank Bruno Watier (Université Paul Sabatier and LAAS-CNRS) and Galo Maldonado (ENSAM ParisTech) for making public the Parkour dataset. This work was partly supported by the ERC grant LEAP (No. 336845), the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, references ANR-19-P3IA-0001 (PRAIRIE 3IA Institute) and ANR-19-P3IA-0004 (ANITI 3IA Institute) , and the European Regional Development Fund under the project IMPACT (reg. no. CZ.02.1.01/0.0/0.0/15 003/0000468).
Author information
Authors and Affiliations
Additional information
Communicated by Wenjun Zeng.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendices
Outline of the Appendix
In this appendix, we provide additional technical details of the proposed method. In Appendix A, we provide a comprehensive description of the parametric human and object model we use for the trajectory optimization. Then, in Appendix B we give details of the ground contact force generators mentioned in the main paper (Sect. 4.3).
Parametric Human and Object Models
Human model We model the human body as a multi-body system consisting of a set of rotating joints and rigid links connecting them. We adopt the joint definition of the SMPL model (Loper et al. 2015) and approximate the human skeleton as a kinematic tree with 24 joints: one free-floating joint and 23 spherical joints. Figure 12 illustrates our human model in a canonical pose. A free-floating joint consists of a 3-dof translation in \({\mathbb {R}}^3\) and a 3-dof rotation in SO(3); we model the pelvis by a free-floating joint to describe the person’s body orientation and translation in the world coordinate frame. A spherical joint is a 3-dof rotation; it represents the relative rotation between two connected links in our model. In practice, we use unit quaternions to represent 3D rotations and axis-angles to describe angular velocities. As a result, the configuration vector of our human model \(q^\mathrm {h}\) is a concatenation of the configuration vectors of the 23 spherical joints (dimension 4) and the free-floating pelvis joint (dimension 7), hence of dimension 99. The corresponding human joint velocity \({\dot{q}}^\mathrm {h}\) is of dimension \(23\times 3+6=75\) (by replacing the quaternions with axis-angles). For simplicity, in the main paper we do not distinguish this difference in dimension and consider both \(q^\mathrm {h}\) and \({\dot{q}}^\mathrm {h}\) to be represented using axis-angles, hence of the same dimension \(n_q^\mathrm {h}=75\). In addition, based on these 24 joints, we define 18 “virtual markers” (shown as colored spheres in Fig. 12) that represent the 18 OpenPose joints. These markers are used instead of the 24 joints to compute the re-projection errors with respect to the OpenPose 2D detections.
Object models All four objects, namely barbell, hammer, scythe and spade, are modeled as a non-deformable rigid line stick. The configuration \(q^\mathrm {o}\) represents the 6-dof displacement of the stick handle, as illustrated in Fig. 13. In practice, \(q^\mathrm {o}\) is a 7-dimensional vector containing the 3D translation and 4D quaternion rotation of the free-floating handle end. The object joint velocity \({\dot{q}}^\mathrm {o}\) is of dimension 6 (by replacing the quaternion with an axis-angle). The handtools that we are modelling have the stick handle as the contact area. We ignore the handle’s thickness and represent the contact area using the line segment between the two endpoints of the handle. Depending on the number of human joints in contact with the object, we associate the same number of contact points to the object’s local coordinate frame. These contact points can be located at any point along the feasible contact area. In practice, all object contact points together with the endpoint corresponding to the head of the handtool are implemented as “virtual” prismatic joints of dimension 1.
Generators of the Ground Contact Forces
In this section, we describe the generators \(g^{(3)}_n\) and \(g^{(6)}_{kn}\) for computing the contact forces exerted by the ground on the person. Recall from the main paper that we consider different contact models depending on the type of the joint. We model the planar contacts between the human sole and the ground by fitting the point contact model [given by Eq. (9) in the main paper] at each of the four sole vertices. For other types of ground contacts, e.g. the knee-ground contact, we apply the point contact model directly at the human joint. We model the ground as a 2D plane \(G = \{p\in {\mathbb {R}}^3|a^Tp=b\}\) with a normal vector \(a\in {\mathbb {R}}^3\), \(a\ne 0\), \(b\in {\mathbb {R}}\) and a friction coefficient \(\mu \). In the following, we first provide the expression of the 3D generators \(g^{(3)}_n\) for modeling point contact forces and then derive the 6D generators \(g^{(6)}_{kn}\) for modeling planar contact forces.
3D generators \(g^{(3)}_n\) for point contact forces Let \(p_k\) be the position of a contact point k located on the ground surface, i.e. \(a^Tp_k=b\). We define at contact point k a right-hand coordinate frame C whose xz-plane overlaps the plane G and whose y-axis points towards the gravity direction, i.e., the opposite direction to the ground normal a. During point contact, it is a common assumption that the ground exerts only linear reaction forces on the contact point c. In other words, the spatial contact force expressed in the local frame C can be expressed as
where the linear component f must lie in the second-order cone \({\mathcal {K}}^3 = \{f=(f_x,f_y,f_z)^T|\sqrt{f_x^2 + f_z^2} \le -f_y \tan \mu \}\), which can be approximated by the pyramid \({{\mathcal {K}}^3}^\prime = \{f=\sum _{n=1}^4{\lambda _n g^{(3)}_n}|\lambda _n\ge 0\}\), with a set of 3D-generators
where \(\mu \) is the friction coefficient. More formally, we are approximating the friction cone \({\mathcal {K}}^3\) with the conic hull \({{\mathcal {K}}^3}^\prime \) spanned by 4 points on the boundary of \({\mathcal {K}}^3\), namely, \(g^{(3)}_n\) with \(n=1,2,3,4\).
6D generators \(g^{(6)}_{kn}\) for planar (sole) contact forces Here we show how to obtain the 6D generator \(g^{(6)}_{kn}\) from \(g^{(3)}_{n}\) and the contact point position \(p_k\). As described in the main paper, we approximate human sole as a rectangle area with 4 contact points. We assume that the sole overlaps the ground plane G during contact. Similar to the point contact, we define 5 parallel coordinate frames \(C_k\), one at each of the four sole contact points, plus a frame A at the ankle joint. Note that the frames \(C_k\) and A are parallel to each other, i.e., there is no rotation but only translation when passing from one frame to another. We can write the contact force at contact point k as the 6D spatial force
We denote by \(^Ap_k\) the position of contact point \(c_k\) in the ankle frame A, and by \(^AX_{C_k}^*\) the matrix converting spatial forces from frame \(C_k\) to frame A. We can then express the contact force in frame A:
where
Rights and permissions
About this article
Cite this article
Li, Z., Sedlar, J., Carpentier, J. et al. Estimating 3D Motion and Forces of Human–Object Interactions from Internet Videos. Int J Comput Vis 130, 363–383 (2022). https://doi.org/10.1007/s11263-021-01540-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-021-01540-1