Skip to main content
Log in

Action Anticipation in First-Person Videos with Self-Attention Based Multi-Modal Network

  • APPLICATION PROBLEMS
  • Published:
Pattern Recognition and Image Analysis Aims and scope Submit manuscript

Abstract

In this paper, we propose a self-attention based multi-modal LSTM framework for the challenging task of action anticipation in first-person videos. Our framework comprehensively considers three video features: RGB images for spatial information, optical flow fields for temporal information, and object-based features to figure out which object the camera wearer interacts with. Different from some previous works that directly utilize features after convolutional layers, we encode multi-modal features by a self-attention mechanism based on the similarity between text sequences and video sequences. The positional vector based on trigonometric function is added to encode the position of the frame so that the self-attention module can learn the position information of the sequence. We use multi-modal LSTMs to load the historical information of the video and generate predictions at different anticipation times. The performance of the proposed method is evaluated on two benchmark datasets, which shows that our framework outperforms the state-of-the-art approaches on metrics, and solved the problem of poor long-term prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1.
Fig. 2.

Similar content being viewed by others

REFERENCES

  1. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization” (2016). arXiv:1607.06450

  2. M. Chen, Z. Xu, K. Q. Weinberger, and F. Sha, “Marginalized denoising autoencoders for domain adaptation,” in Proc. 29th Int. Conf. on Machine Learning, Edinburgh, 2012 (Omnipress, Madison, Wis., 2012), pp. 1627–1634. arXiv: 1206.4683

  3. D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray, “Scaling egocentric vision: The epic-kitchens dataset,” in Computer Vision – ECCV 2018, Ed. by V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Lecture Notes in Computer Science, vol. 11208 (Springer, Cham, 2018), pp. 753–771.  https://doi.org/10.1007/978-3-030-01225-0_44

    Book  Google Scholar 

  4. A. Furnari and G. M. Farinella, “What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention,” in IEEE/CVF Int. Conf. on Computer Vision (ICCV), Seoul, 2019 (IEEE, 2019), pp. 6251–6260.  https://doi.org/10.1109/ICCV.2019.00635

  5. A. Furnari and G. M. Farinella, “Rolling-unrolling LSTMs for action anticipation from first-person video,” IEEE Trans. Pattern Anal. Mach. Intell. 43, 4021–4036 (2021).  https://doi.org/10.1109/TPAMI.2020.2992889

    Article  Google Scholar 

  6. A. Furnari, S. Battiato, and G. M. Farinella. “Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation,” in Computer Vision – ECCV 2018 Workshops, Ed. by L. Leal-Taixé and S. Roth, Lecture Notes in Computer Science, vol. 11133 (Springer, Cham, 2019), pp. 389–405. https://doi.org/10.1007/978-3-030-11021-5_24

    Book  Google Scholar 

  7. Y.Gao, D. P. Barrett, A. Barbu, S. Narayanaswamy, H. Yu, A. Michaux, Y. Lin, S. Dickinson, J. M. Siskind, and S. Wang, “Recognize human activities from partially obsereved videos,” in IEEE Conf. on Computer Vision and Pattern Recognition, Portland, 2013 (IEEE, 2013), pp. 2658–2665.  https://doi.org/10.1109/CVPR.2013.343

  8. J. Gao, Z. Yang, and R. Nevatia, “Red: Reinforced encoder decoder networks for action anticipation,” in Proc. British Mach. Vision Conf., London, 2017. arXiv:1707.04818

  9. R. De Geest and T. Tuytelaars, “Modeling temporal structure with LSTM for online action detection,” in IEEE Winter Conf. on Applications of Computer Vision (WACV), Lake Tahoe, Nev., 2018 (IEEE, 2018), pp. 1549–1557.  https://doi.org/10.1109/WACV.2018.00173

  10. R. Girshick, “Fast R-CNN,” in IEEE Int. Conf. on Computer Vision (ICCV), Santiago, 2015 (IEEE, 2015), pp. 1440–1448.  https://doi.org/10.1109/ICCV.2015.169

  11. R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He, “Detectron” (2018). https://github.com/facebookresearch/detectron.

  12. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016 (IEEE, 2016), pp. 770–778.  https://doi.org/10.1109/CVPR.2016.90

  13. Y. Kong, D. Kit, and Y. Fu, “A discriminative model with multiple temporal scales for action prediction,” in Computer Vision – ECCV 2014, Ed. by D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Lecture Notes in Computer Science, vol. 8693 (Springer, Cham, 2014), pp. 596–611.  https://doi.org/10.1007/978-3-319-10602-1_39

    Book  Google Scholar 

  14. Y. Kong and Y. Fu, “Max-margin action prediction machine,” IEEE Trans. Pattern Anal. Mach. Intell. 38, 1844–1858 (2015).  https://doi.org/10.1109/TPAMI.2015.2491928

    Article  MathSciNet  Google Scholar 

  15. Y. Kong, Z. Tao, and Y. Fu, “Deep sequential context networks for action prediction,” in IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, Hawaii, 2017 (IEEE, 2017), pp. 3662–3670.  https://doi.org/10.1109/CVPR.2017.390

  16. Y. Li, M. Liu, and J. M. Rehg, “In the eye of beholder: Joint learning of gaze and actions in first person video,” in Computer Vision – ECCV 2018, Ed. by V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Lecture Notes in Computer Science, vol. 11209 (Springer, Cham, 2018), pp. 639–655.  https://doi.org/10.1007/978-3-030-01228-1_38

    Book  Google Scholar 

  17. S. Ma, L. Sigal, and S. Sclaroff, “Learning activity progression in lstms for activity detection and early detection,” in IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, 2016 (IEEE, 2016), pp. 1942–1950.  https://doi.org/10.1109/CVPR.2016.214

  18. M. Hoai and F. De la Torre, “Max-margin early event detectors,” Int. J. Comput. Vision 107, 191–202 (2014).  https://doi.org/10.1007/s11263-013-0683-3

    Article  MathSciNet  Google Scholar 

  19. M. S. Ryoo, “Human activity prediction: Early recognition of ongoing activities from streaming videos,” in Int. Conf. on Computer Vision, Barcelona, 2011 (IEEE, 2011), pp. 2983–2991.  https://doi.org/10.1109/ICCV.2011.6126349

  20. G. A. Sigurdsson, O. Russakovsky, and A. Gupta, “What actions are needed for understanding human actions in videos,” in IEEE Int. Conf. on Computer Vision (ICCV), Venice, 2017 (IEEE, 2017), pp. 2156–2165. https://doi.org/10.1109/ICCV.2017.235

  21. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, Ed. by I. Guyon, S. Vishwanathan, and R. Garnett (Curran Associates, 2017), pp. 5998–6008.

    Google Scholar 

  22. C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual representations from unlabeled video,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016 (IEEE, 2016), pp. 98–106.  https://doi.org/10.1109/CVPR.2016.18

  23. L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Val Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in Computer Vision – ECCV 2016, Ed. by B. Leibe, J. Matas, N. Sebe, and M. Welling, Lecture Notes in Computer Science, vol. 9912 (Springer, Cham, 2016), pp. 20–36.  https://doi.org/10.1007/978-3-319-46484-8_2

    Book  Google Scholar 

  24. C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime tv-L 1 optical flow,” in Pattern Recognition. DAGM 2007, Ed. by F. A. Hamprecht, C. Schnörr, and B. Jänne, Lecture Notes in Computer Science, vol. 4713 (Springer, Berlin, 2007), pp. 214–223. https://doi.org/10.1007/978-3-540-74936-3_22

    Book  Google Scholar 

Download references

ACKNOWLEDGMENTS

This work is supported by the Local College Capacity Building Project of Shanghai Municipal Science and Technology Commission (project no. 20020500700) and by National Natural Science Foundation of China (project no. 61802250).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jie Shao or Chen Mo.

Ethics declarations

COMPLIANCE WITH ETHICAL STANDARDS

This article is a completely original work of its authors; it has not been published before and will not be sent to other publications until the PRIA Editorial Board decides not to accept it for publication.

Conflict of Interest

The authors declare that they have no conflicts of interest.

Additional information

Jie Shao received a B.S. and an M.S. degree in the Nanjing University of Aeronautics and Astronautics. Then, she got her PhD at Tongji University. At present, she is an associate professor at Shanghai University of Electric Power. Her current research interest includes computer vision, video surveillance, and human emotion analysis.

Chen Mo received his Bachelor’s degree in electrical engineering and automation from Heilongjiang University of Science and Technology in 2018. He is currently a graduate student in the Department of Electronics and Information Engineering at Shanghai University of Electric Power. His research interest includes video understanding and action anticipation.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jie Shao, Chen Mo Action Anticipation in First-Person Videos with Self-Attention Based Multi-Modal Network. Pattern Recognit. Image Anal. 32, 429–435 (2022). https://doi.org/10.1134/S1054661822020183

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S1054661822020183

Keywords:

Navigation