Skip to main content
Log in

Estimating 3D Motion and Forces of Human–Object Interactions from Internet Videos

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In this paper, we introduce a method to automatically reconstruct the 3D motion of a person interacting with an object from a single RGB video. Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces exerted on the human body. The main contributions of this work are three-fold. First, we introduce an approach to jointly estimate the motion and the actuation forces of the person on the manipulated object by modeling contacts and the dynamics of the interactions. This is cast as a large-scale trajectory optimization problem. Second, we develop a method to automatically recognize from the input video the 2D position and timing of contacts between the person and the object or the ground, thereby significantly simplifying the complexity of the optimization. Third, we validate our approach on a recent video + MoCap dataset capturing typical parkour actions, and demonstrate its performance on a new dataset of Internet videos showing people manipulating a variety of tools in unconstrained environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. In this paper, trajectories are denoted as underlined variables, e.g. \({\underline{x}},{\underline{u}}~\text {or}~{\underline{c}}\).

  2. Spatial velocities (accelerations) are minimal and unified representations of linear and angular velocities (accelerations) of a rigid body (Featherstone 2008). They are of dimension 6.

References

  • Abdulla, W. (2017). Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow. https://github.com/matterport/Mask_RCNN

  • Agarwal, S., Mierle, K., et al. (2012). Ceres solver. http://ceres-solver.org

  • Akhter, I., & Black, M. J. (2015). Pose-conditioned joint angle limits for 3d human pose reconstruction. In CVPR

  • Alayrac, J. B., Bojanowski, P., Agrawal, N., Laptev, I., Sivic, J., & Lacoste-Julien, S. (2016). Unsupervised learning from narrated instruction videos. In CVPR.

  • Andriluka, M., Pishchulin, L., Gehler, P., & Schiele, B. (2014). 2d human pose estimation: New benchmark and state of the art analysis. In CVPR.

  • Biegler, L. T. (2010). Nonlinear programming: Concepts, algorithms, and applications to chemical processes (Vol. 10, Chap 10). SIAM.

  • Bogo, F., Kanazawa, A., Lassner, C., Gehler, P., Romero, J., & Black, M. J. (2016). Keep it SMPL: Automatic estimation of 3d human pose and shape from a single image. In ECCV.

  • Boulic, R., Thalmann, N. M., & Thalmann, D. (1990). A global human walking model with real-time kinematic personification. The Visual Computer, 6(6), 344–358. https://doi.org/10.1007/BF01901021

    Article  Google Scholar 

  • Bourdev, L., & Malik, J. (2011). The human annotation tool. https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/shape/hat/

  • Brachmann, E., Michel, F., Krull, A., Ying Yang, M., Gumhold, S., et al. (2016). Uncertainty-driven 6d pose estimation of objects and scenes from a single rgb image. In CVPR.

  • Brubaker, M. A., Fleet, D. J., & Hertzmann, A. (2007). Physics-based person tracking using simplified lower-body dynamics. In CVPR.

  • Brubaker, M. A., Sigal, L., & Fleet, D. J. (2009). Estimating contact dynamics. In CVPR.

  • Cao, Z., Simon, T., Wei, S. E., & Sheikh, Y. (2017). Realtime multi-person 2d pose estimation using part affinity fields. In CVPR.

  • Carpentier, J., & Mansard, N. (2018a). Analytical derivatives of rigid body dynamics algorithms. In Robotics: Science and Systems.

  • Carpentier, J., & Mansard, N. (2018b). Multi-contact locomotion of legged robots. IEEE Transactions on Robotics.

  • Carpentier, J., Valenza, F., Mansard, N., et al. (2015–2019). Pinocchio: Fast forward and inverse dynamics for poly-articulated systems. https://stack-of-tasks.github.io/pinocchio

  • Carpentier, J., Del Prete, A., Tonneau, S., Flayols, T., Forget, F., Mifsud, A., et al. (2017). Multi-contact locomotion of legged robots in complex environments-the loco3d project. In RSS workshop on challenges in dynamic legged locomotion (p. 3p).

  • Carpentier, J., Saurel, G., Buondonno, G., Mirabel, J., Lamiraux, F., Stasse, O., & Mansard, N. (2019). The pinocchio c++ library—A fast and flexible implementation of rigid body dynamics algorithms and their analytical derivatives. In IEEE international symposium on system integrations (SII).

  • Chen, C. H., & Ramanan, D. (2017). 3d human pose estimation= 2d pose estimation+ matching. In CVPR.

  • Delaitre, V., Sivic, J., & Laptev, I. (2011). Learning person–object interactions for action recognition in still images. In NIPS.

  • Diehl, M., Bock, H., Diedam, H., & Wieber, P. B. (2006). Fast direct multiple shooting algorithms for optimal robot control. In Fast motions in biomechanics and robotics. Springer.

  • Doumanoglou, A., Kouskouridas, R., Malassiotis, S., & Kim, T. K. (2016). 6D object detection and next-best-view prediction in the crowd. In CVPR.

  • Featherstone, R. (2008). Rigid body dynamics algorithms. Berlin: Springer.

    Book  Google Scholar 

  • Fouhey, D. F., Delaitre, V., Gupta, A., Efros, A. A., Laptev, I., & Sivic, J. (2014). People watching: Human actions as a cue for single view geometry. IJCV, 110(3), 259–274.

    Article  Google Scholar 

  • Gall, J., Rosenhahn, B., Brox, T., & Seidel, H. P. (2010). Optimization and filtering for human motion capture. International Journal of Computer Vision, 87(1–2), 75.

    Article  Google Scholar 

  • Gammeter, S., Ess, A., Jäggli, T., Schindler, K., Leibe, B., & Van Gool, L. (2008). Articulated multi-body tracking under egomotion. In ECCV.

  • Gower, J. C. (1975). Generalized procrustes analysis. Psychometrika, 40(1), 33–51.

    Article  MathSciNet  Google Scholar 

  • Grabner, A., Roth, P. M., & Lepetit, V. (2018). 3D pose estimation and 3D model retrieval for objects in the wild. In CVPR.

  • Gupta, A., Kembhavi, A., & Davis, L. S. (2009). Observing human–object interactions: Using spatial and functional compatibility for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(10), 1775–1789.

    Article  Google Scholar 

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. B. (2017). Mask R-CNN. CoRR. arXiv:1703.06870

  • Herdt, A., Perrin, N., & Wieber, P. B. (2010). Walking without thinking about it. In International Conference on Intelligent Robots and Systems (IROS). https://doi.org/10.1109/IROS.2010.5654429

  • Hinterstoisser, S., Lepetit, V., Rajkumar, N., & Konolige, K. (2016). Going further with point pair features. In ECCV.

  • Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., & Schiele, B. (2016). Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV.

  • Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), 1325–1339.

    Article  Google Scholar 

  • Jiang, Y., Koppula, H., & Saxena, A. (2013). Hallucinated humans as the hidden context for labeling 3d scenes. In CVPR.

  • Kanazawa, A., Black, M. J., Jacobs, D. W., & Malik, J. (2018). End-to-end recovery of human shape and pose. In CVPR.

  • Kanazawa, A., Zhang, J. Y., Felsen, P., & Malik, J. (2019). Learning 3d human dynamics from video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5614–5623).

  • Kocabas, M., Athanasiou, N., & Black, M. J. (2020). Vibe: Video inference for human body pose and shape estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5253–5263).

  • Kuffner, J., Nishiwaki, K., Kagami, S., Inaba, M., & Inoue, H. (2005). Motion planning for humanoid robots. In Robotics research. The eleventh international symposium.

  • Li, Y., Wang, G., Ji, X., Xiang, Y., & Fox, D. (2018). DeepIM: Deep iterative matching for 6D pose estimation. In ECCV.

  • Li, Z., Sedlar, J., Carpentier, J., Laptev, I., Mansard, N., & Sivic, J. (2019). Estimating 3d motion and forces of person–object interactions from monocular video. In Computer vision and pattern recognition (CVPR).

  • Lin, T., Maire, M., Belongie, S. J., Bourdev, L. D., Girshick, R. B., Hays, J., et al. (2014). Microsoft COCO: common objects in context. CoRRarXiv:1405.0312

  • Loing, V., Marlet, R., & Aubry, M. (2018). Virtual training for a real application: Accurate object-robot relative localization without calibration. In IJCV. https://doi.org/10.1007/s11263-018-1102-6

  • Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., & Black, M. J. (2015). SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6), 248.

    Article  Google Scholar 

  • Loper, M. M., Mahmood, N., & Black, M. J. (2014). MoSh: Motion and shape capture from sparse markers. ACM Transactions on Graphics (Proc SIGGRAPH Asia), 33(6), 220:1-220:13. https://doi.org/10.1145/2661229.2661273

    Article  Google Scholar 

  • Maldonado, G. (2018). Some biomechanical and robotic models. https://github.com/GaloMALDONADO/Models

  • Maldonado, G., Bailly, F., Souères, P., & Watier, B. (2017). Angular momentum regulation strategies for highly dynamic landing in Parkour. Computer Methods in Biomechanics and Biomedical Engineering, 20(sup1), 123–124. https://doi.org/10.1080/10255842.2017.1382892, https://hal.archives-ouvertes.fr/hal-01636353

  • Malmaud, J., Huang, J., Rathod, V., Johnston, N., Rabinovich, A., & Murphy, K. (2015). What’s cookin’? interpreting cooking videos using text, speech and vision. arXiv preprint arXiv:1503.01558

  • Marinoiu, E., Papava, D., & Sminchisescu, C. (2013). Pictorial human spaces: How well do humans perceive a 3d articulated pose? In Proceedings of the IEEE International Conference on Computer Vision (pp. 1289–1296).

  • Martinez, J., Hossain, R., Romero, J., & Little, J. J. (2017). A simple yet effective baseline for 3d human pose estimation. In ICCV.

  • Mordatch, I., Todorov, E., & Popović, Z. (2012). Discovery of complex behaviors through contact-invariant optimization. ACM Transactions on Graphics (TOG), 31(4), 43.

    Article  Google Scholar 

  • Moreno-Noguer, F. (2017). 3d human pose estimation from a single image via distance matrix regression. In CVPR.

  • Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV.

  • Newell, A., Huang, Z., & Deng, J. (2017). Associative embedding: End-to-end learning for joint detection and grouping. In NIPS.

  • Oberweger, M., Rad, M., & Lepetit, V. (2018). Making deep heatmaps robust to partial occlusions for 3D object pose estimation. In ECCV.

  • Pavlakos, G., Zhou, X., Derpanis, K. G., & Daniilidis, K. (2017). Coarse-to-fine volumetric prediction for single-image 3d human pose. In CVPR.

  • Posa, M., Cantu, C., & Tedrake, R. (2014). A direct method for trajectory optimization of rigid bodies through contact. The International Journal of Robotics Research, 33(1), 69–81.

    Article  Google Scholar 

  • Prest, A., Ferrari, V., & Schmid, C. (2013). Explicit modeling of human–object interactions in realistic videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(4), 835–848.

    Article  Google Scholar 

  • Project webpage. (2021). https://www.di.ens.fr/willow/research/motionforcesfromvideo/

  • Rad, M., & Lepetit, V. (2017). Bb8: A scalable, accurate, robust to partial occlusion method for predicting the 3d poses of challenging objects without using depth. In ICCV.

  • Rad, M., Oberweger, M., & Lepetit, V. (2018). Feature mapping for learning fast and accurate 3D pose inference from synthetic images. In CVPR.

  • Rempe, D., Guibas, L. J., Hertzmann, A., Russell, B., Villegas, R., & Yang, J. (2020). Contact and human dynamics from monocular video. In European conference on computer vision (pp. 71–87). Springer.

  • Schultz, G., & Mombaur, K. (2010). Modeling and optimal control of human-like running. IEEE/ASME Transactions on Mechatronics, 15(5), 783–792.

    Article  Google Scholar 

  • Shimada, S., Golyanik, V., Xu, W., & Theobalt, C. (2020). Physcap: Physically plausible monocular 3d motion capture in real time. ACM Transactions on Graphics (TOG), 39(6), 1–16.

    Article  Google Scholar 

  • Sidenbladh, H., Black, M. J., Fleet, D. J. (2000). Stochastic tracking of 3d human figures using 2d image motion. In ECCV.

  • Tassa, Y., Erez, T., & Todorov, E. (2012). Synthesis and stabilization of complex behaviors through online trajectory optimization. In IEEE international conference on intelligent robots and systems (IROS). https://doi.org/10.1109/IROS.2012.6386025

  • Taylor, C. J. (2000). Reconstruction of articulated objects from point correspondences in a single uncalibrated image. Computer Vision and Image Understanding, 80(3), 349–363.

    Article  Google Scholar 

  • Tejani, A., Tang, D., Kouskouridas, R., & Kim, T. K. (2014). Latent-class hough forests for 3d object detection and pose estimation. In ECCV.

  • Tekin, B., Rozantsev, A., Lepetit, V., & Fua, P. (2016). Direct prediction of 3d body poses from motion compensated sequences. In CVPR.

  • Tobin, J., Fong, R., Ray, A., Schneider, J., Zaremba, W., & Abbeel, P. (2017). Domain randomization for transferring deep neural networks from simulation to the real world. CoRR. arXiv:1703.06907

  • Tonneau, S., Del Prete, A., Pettré, J., Park, C., Manocha, D., & Mansard, N. (2018a). An efficient acyclic contact planner for multiped robots. IEEE Transactions on Robotics (TRO). https://doi.org/10.1109/TRO.2018.2819658

  • Tonneau, S., Del Prete, A., Pettré, J., Park, C., Manocha, D., & Mansard, N. (2018b). An efficient acyclic contact planner for multiped robots. IEEE Transactions on Robotics, 34(3), 586–601.

  • Triggs, B., McLauchlan, P. F., Hartley, R. I., Fitzgibbon, A. W. (1999). Bundle adjustment—A modern synthesis. In International workshop on vision algorithms.

  • Wei, X., & Chai, J. (2010). Videomocap: Modeling physically realistic human motion from monocular video sequences. ACM Transactions on Graphics, 29(4), 42:1-42:10. https://doi.org/10.1145/1778765.1778779

    Article  Google Scholar 

  • Westervelt, E. R., Grizzle, J. W., & Koditschek, D. E. (2003). Hybrid zero dynamics of planar biped walkers. IEEE Transactions on Automatic Control, 48(1), 42–56. https://doi.org/10.1109/TAC.2002.806653

    Article  MathSciNet  MATH  Google Scholar 

  • Winkler, A. W., Bellicoso, C. D., Hutter, M., & Buchli, J. (2018). Gait and trajectory optimization for legged systems through phase-based end-effector parameterization. IEEE Robotics and Automation Letters, 3(3), 1560–1567.

    Article  Google Scholar 

  • Xiang, D., Joo, H., & Sheikh, Y. (2019). Monocular total capture: Posing face, body, and hands in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10965–10974).

  • Xiang, Y., Schmidt, T., Narayanan, V., & Fox, D. (2017). Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. CoRR. arXiv:1711.00199

  • Yao, B., & Fei-Fei, L. (2012). Recognizing human–object interactions in still images by modeling the mutual context of objects and human poses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1691–1703.

    Article  Google Scholar 

  • Zanfir, A., Marinoiu, E., & Sminchisescu, C. (2018). Monocular 3d pose and shape estimation of multiple people in natural scenes-the importance of multiple scene constraints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2148–2157).

  • Zhou, X., Zhu, M., Leonardos, S., Derpanis, K. G., & Daniilidis, K. (2016). Sparseness meets deepness: 3d human pose estimation from monocular video. In CVPR.

Download references

Acknowledgements

We thank Bruno Watier (Université Paul Sabatier and LAAS-CNRS) and Galo Maldonado (ENSAM ParisTech) for making public the Parkour dataset. This work was partly supported by the ERC grant LEAP (No. 336845), the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, references ANR-19-P3IA-0001 (PRAIRIE 3IA Institute) and ANR-19-P3IA-0004 (ANITI 3IA Institute) , and the European Regional Development Fund under the project IMPACT (reg. no. CZ.02.1.01/0.0/0.0/15 003/0000468).

Author information

Authors and Affiliations

Authors

Additional information

Communicated by Wenjun Zeng.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Outline of the Appendix

In this appendix, we provide additional technical details of the proposed method. In Appendix A, we provide a comprehensive description of the parametric human and object model we use for the trajectory optimization. Then, in Appendix B we give details of the ground contact force generators mentioned in the main paper (Sect. 4.3).

Parametric Human and Object Models

Human model We model the human body as a multi-body system consisting of a set of rotating joints and rigid links connecting them. We adopt the joint definition of the SMPL model (Loper et al. 2015) and approximate the human skeleton as a kinematic tree with 24 joints: one free-floating joint and 23 spherical joints. Figure 12 illustrates our human model in a canonical pose. A free-floating joint consists of a 3-dof translation in \({\mathbb {R}}^3\) and a 3-dof rotation in SO(3); we model the pelvis by a free-floating joint to describe the person’s body orientation and translation in the world coordinate frame. A spherical joint is a 3-dof rotation; it represents the relative rotation between two connected links in our model. In practice, we use unit quaternions to represent 3D rotations and axis-angles to describe angular velocities. As a result, the configuration vector of our human model \(q^\mathrm {h}\) is a concatenation of the configuration vectors of the 23 spherical joints (dimension 4) and the free-floating pelvis joint (dimension 7), hence of dimension 99. The corresponding human joint velocity \({\dot{q}}^\mathrm {h}\) is of dimension \(23\times 3+6=75\) (by replacing the quaternions with axis-angles). For simplicity, in the main paper we do not distinguish this difference in dimension and consider both \(q^\mathrm {h}\) and \({\dot{q}}^\mathrm {h}\) to be represented using axis-angles, hence of the same dimension \(n_q^\mathrm {h}=75\). In addition, based on these 24 joints, we define 18 “virtual markers” (shown as colored spheres in Fig. 12) that represent the 18 OpenPose joints. These markers are used instead of the 24 joints to compute the re-projection errors with respect to the OpenPose 2D detections.

Object models All four objects, namely barbell, hammer, scythe and spade, are modeled as a non-deformable rigid line stick. The configuration \(q^\mathrm {o}\) represents the 6-dof displacement of the stick handle, as illustrated in Fig. 13. In practice, \(q^\mathrm {o}\) is a 7-dimensional vector containing the 3D translation and 4D quaternion rotation of the free-floating handle end. The object joint velocity \({\dot{q}}^\mathrm {o}\) is of dimension 6 (by replacing the quaternion with an axis-angle). The handtools that we are modelling have the stick handle as the contact area. We ignore the handle’s thickness and represent the contact area using the line segment between the two endpoints of the handle. Depending on the number of human joints in contact with the object, we associate the same number of contact points to the object’s local coordinate frame. These contact points can be located at any point along the feasible contact area. In practice, all object contact points together with the endpoint corresponding to the head of the handtool are implemented as “virtual” prismatic joints of dimension 1.

Fig. 12
figure 12

Our human model in the reference posture. The skeleton consists of one free-floating basis joint corresponding to pelvis, and 23 spherical joints. The colored spheres are 18 virtual markers that correspond to 18 OpenPose joints. Each marker is associated to a semantic joint in our model

Fig. 13
figure 13

All four handtools are represented by a single object model shown in this image. The object model consists of 1 free-floating basis joint corresponding to the handle end point (red sphere), 1 prismatic joint corresponding to the head of the tool (green sphere), and several prismatic joints corresponding to the location of the contact points (grey translucent spheres in the middle). The contact points should lie on the feasible contact area (grey stick) formed by the two endpoints

Generators of the Ground Contact Forces

In this section, we describe the generators \(g^{(3)}_n\) and \(g^{(6)}_{kn}\) for computing the contact forces exerted by the ground on the person. Recall from the main paper that we consider different contact models depending on the type of the joint. We model the planar contacts between the human sole and the ground by fitting the point contact model [given by Eq. (9) in the main paper] at each of the four sole vertices. For other types of ground contacts, e.g. the knee-ground contact, we apply the point contact model directly at the human joint. We model the ground as a 2D plane \(G = \{p\in {\mathbb {R}}^3|a^Tp=b\}\) with a normal vector \(a\in {\mathbb {R}}^3\), \(a\ne 0\), \(b\in {\mathbb {R}}\) and a friction coefficient \(\mu \). In the following, we first provide the expression of the 3D generators \(g^{(3)}_n\) for modeling point contact forces and then derive the 6D generators \(g^{(6)}_{kn}\) for modeling planar contact forces.

3D generators \(g^{(3)}_n\) for point contact forces Let \(p_k\) be the position of a contact point k located on the ground surface, i.e. \(a^Tp_k=b\). We define at contact point k a right-hand coordinate frame C whose xz-plane overlaps the plane G and whose y-axis points towards the gravity direction, i.e., the opposite direction to the ground normal a. During point contact, it is a common assumption that the ground exerts only linear reaction forces on the contact point c. In other words, the spatial contact force expressed in the local frame C can be expressed as

$$\begin{aligned} ^C\phi = \begin{pmatrix} f \\ {\mathbf {0}}_{3\times 1} \end{pmatrix}, \end{aligned}$$
(17)

where the linear component f must lie in the second-order cone \({\mathcal {K}}^3 = \{f=(f_x,f_y,f_z)^T|\sqrt{f_x^2 + f_z^2} \le -f_y \tan \mu \}\), which can be approximated by the pyramid \({{\mathcal {K}}^3}^\prime = \{f=\sum _{n=1}^4{\lambda _n g^{(3)}_n}|\lambda _n\ge 0\}\), with a set of 3D-generators

$$\begin{aligned} g^{(3)}_1&= \left( \sin {\mu }, -\cos {\mu }, 0\right) ^T, \end{aligned}$$
(18)
$$\begin{aligned} g^{(3)}_2&= \left( -\sin {\mu }, -\cos {\mu }, 0\right) ^T, \end{aligned}$$
(19)
$$\begin{aligned} g^{(3)}_3&= \left( 0, -\cos {\mu }, \sin {\mu }\right) ^T, \end{aligned}$$
(20)
$$\begin{aligned} g^{(3)}_4&= \left( 0, -\cos {\mu }, -\sin {\mu }\right) ^T, \end{aligned}$$
(21)

where \(\mu \) is the friction coefficient. More formally, we are approximating the friction cone \({\mathcal {K}}^3\) with the conic hull \({{\mathcal {K}}^3}^\prime \) spanned by 4 points on the boundary of \({\mathcal {K}}^3\), namely, \(g^{(3)}_n\) with \(n=1,2,3,4\).

6D generators \(g^{(6)}_{kn}\) for planar (sole) contact forces Here we show how to obtain the 6D generator \(g^{(6)}_{kn}\) from \(g^{(3)}_{n}\) and the contact point position \(p_k\). As described in the main paper, we approximate human sole as a rectangle area with 4 contact points. We assume that the sole overlaps the ground plane G during contact. Similar to the point contact, we define 5 parallel coordinate frames \(C_k\), one at each of the four sole contact points, plus a frame A at the ankle joint. Note that the frames \(C_k\) and A are parallel to each other, i.e., there is no rotation but only translation when passing from one frame to another. We can write the contact force at contact point k as the 6D spatial force

$$\begin{aligned} ^{C_k}\phi _k = \sum _{n=1}^4 \lambda _{kn} \begin{pmatrix} g^{(3)}_n \\ {\mathbf {0}}_{3\times 1} \end{pmatrix} , \text { with } \lambda _{kn} \ge 0. \end{aligned}$$
(22)

We denote by \(^Ap_k\) the position of contact point \(c_k\) in the ankle frame A, and by \(^AX_{C_k}^*\) the matrix converting spatial forces from frame \(C_k\) to frame A. We can then express the contact force in frame A:

$$\begin{aligned} ^A\phi&= \sum _{k=1}^4{{^AX_{C_k}^*}^{C_k}\phi _k} \end{aligned}$$
(23)
$$\begin{aligned}&= \sum _{k=1}^4{\begin{pmatrix} I_3 &{} ^Ap_k\times \\ 0_3 &{} I_3\\ \end{pmatrix}^{-T}{^{C_k}\phi _k}} \end{aligned}$$
(24)
$$\begin{aligned}&=\sum _{k=1}^4\sum _{n=1}^4 \lambda _{kn} g^{(6)}_{kn}, \end{aligned}$$
(25)

where

$$\begin{aligned} g^{(6)}_{kn}= \begin{pmatrix} g^{(3)}_n \\ ^Ap_k\times g^{(3)}_n \end{pmatrix}. \end{aligned}$$
(26)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, Z., Sedlar, J., Carpentier, J. et al. Estimating 3D Motion and Forces of Human–Object Interactions from Internet Videos. Int J Comput Vis 130, 363–383 (2022). https://doi.org/10.1007/s11263-021-01540-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01540-1

Keywords

Navigation