Action Anticipation in First-Person Videos with Self-Attention Based Multi-Modal Network

Jie Shao; Chen Mo

doi:10.1134/S1054661822020183

Action Anticipation in First-Person Videos with Self-Attention Based Multi-Modal Network

APPLICATION PROBLEMS
Published: 06 July 2022

Volume 32, pages 429–435, (2022)
Cite this article

Pattern Recognition and Image Analysis Aims and scope Submit manuscript

Jie Shao¹ &
Chen Mo¹

107 Accesses
Explore all metrics

Abstract

In this paper, we propose a self-attention based multi-modal LSTM framework for the challenging task of action anticipation in first-person videos. Our framework comprehensively considers three video features: RGB images for spatial information, optical flow fields for temporal information, and object-based features to figure out which object the camera wearer interacts with. Different from some previous works that directly utilize features after convolutional layers, we encode multi-modal features by a self-attention mechanism based on the similarity between text sequences and video sequences. The positional vector based on trigonometric function is added to encode the position of the frame so that the self-attention module can learn the position information of the sequence. We use multi-modal LSTMs to load the historical information of the video and generate predictions at different anticipation times. The performance of the proposed method is evaluated on two benchmark datasets, which shows that our framework outperforms the state-of-the-art approaches on metrics, and solved the problem of poor long-term prediction.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Three-stream spatio-temporal attention network for first-person action and interaction recognition

Article 17 February 2021

Spatiotemporal saliency-based multi-stream networks with attention-aware LSTM for action recognition

Article 07 July 2020

You watch once more: a more effective CNN architecture for video spatio-temporal action localization

Article 31 January 2024

REFERENCES

J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization” (2016). arXiv:1607.06450
M. Chen, Z. Xu, K. Q. Weinberger, and F. Sha, “Marginalized denoising autoencoders for domain adaptation,” in Proc. 29th Int. Conf. on Machine Learning, Edinburgh, 2012 (Omnipress, Madison, Wis., 2012), pp. 1627–1634. arXiv: 1206.4683
D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray, “Scaling egocentric vision: The epic-kitchens dataset,” in Computer Vision – ECCV 2018, Ed. by V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Lecture Notes in Computer Science, vol. 11208 (Springer, Cham, 2018), pp. 753–771. https://doi.org/10.1007/978-3-030-01225-0_44
Book Google Scholar
A. Furnari and G. M. Farinella, “What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention,” in IEEE/CVF Int. Conf. on Computer Vision (ICCV), Seoul, 2019 (IEEE, 2019), pp. 6251–6260. https://doi.org/10.1109/ICCV.2019.00635
A. Furnari and G. M. Farinella, “Rolling-unrolling LSTMs for action anticipation from first-person video,” IEEE Trans. Pattern Anal. Mach. Intell. 43, 4021–4036 (2021). https://doi.org/10.1109/TPAMI.2020.2992889
Article Google Scholar
A. Furnari, S. Battiato, and G. M. Farinella. “Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation,” in Computer Vision – ECCV 2018 Workshops, Ed. by L. Leal-Taixé and S. Roth, Lecture Notes in Computer Science, vol. 11133 (Springer, Cham, 2019), pp. 389–405. https://doi.org/10.1007/978-3-030-11021-5_24
Book Google Scholar
Y.Gao, D. P. Barrett, A. Barbu, S. Narayanaswamy, H. Yu, A. Michaux, Y. Lin, S. Dickinson, J. M. Siskind, and S. Wang, “Recognize human activities from partially obsereved videos,” in IEEE Conf. on Computer Vision and Pattern Recognition, Portland, 2013 (IEEE, 2013), pp. 2658–2665. https://doi.org/10.1109/CVPR.2013.343
J. Gao, Z. Yang, and R. Nevatia, “Red: Reinforced encoder decoder networks for action anticipation,” in Proc. British Mach. Vision Conf., London, 2017. arXiv:1707.04818
R. De Geest and T. Tuytelaars, “Modeling temporal structure with LSTM for online action detection,” in IEEE Winter Conf. on Applications of Computer Vision (WACV), Lake Tahoe, Nev., 2018 (IEEE, 2018), pp. 1549–1557. https://doi.org/10.1109/WACV.2018.00173
R. Girshick, “Fast R-CNN,” in IEEE Int. Conf. on Computer Vision (ICCV), Santiago, 2015 (IEEE, 2015), pp. 1440–1448. https://doi.org/10.1109/ICCV.2015.169
R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He, “Detectron” (2018). https://github.com/facebookresearch/detectron.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016 (IEEE, 2016), pp. 770–778. https://doi.org/10.1109/CVPR.2016.90
Y. Kong, D. Kit, and Y. Fu, “A discriminative model with multiple temporal scales for action prediction,” in Computer Vision – ECCV 2014, Ed. by D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Lecture Notes in Computer Science, vol. 8693 (Springer, Cham, 2014), pp. 596–611. https://doi.org/10.1007/978-3-319-10602-1_39
Book Google Scholar
Y. Kong and Y. Fu, “Max-margin action prediction machine,” IEEE Trans. Pattern Anal. Mach. Intell. 38, 1844–1858 (2015). https://doi.org/10.1109/TPAMI.2015.2491928
Article MathSciNet Google Scholar
Y. Kong, Z. Tao, and Y. Fu, “Deep sequential context networks for action prediction,” in IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, Hawaii, 2017 (IEEE, 2017), pp. 3662–3670. https://doi.org/10.1109/CVPR.2017.390
Y. Li, M. Liu, and J. M. Rehg, “In the eye of beholder: Joint learning of gaze and actions in first person video,” in Computer Vision – ECCV 2018, Ed. by V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Lecture Notes in Computer Science, vol. 11209 (Springer, Cham, 2018), pp. 639–655. https://doi.org/10.1007/978-3-030-01228-1_38
Book Google Scholar
S. Ma, L. Sigal, and S. Sclaroff, “Learning activity progression in lstms for activity detection and early detection,” in IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, 2016 (IEEE, 2016), pp. 1942–1950. https://doi.org/10.1109/CVPR.2016.214
M. Hoai and F. De la Torre, “Max-margin early event detectors,” Int. J. Comput. Vision 107, 191–202 (2014). https://doi.org/10.1007/s11263-013-0683-3
Article MathSciNet Google Scholar
M. S. Ryoo, “Human activity prediction: Early recognition of ongoing activities from streaming videos,” in Int. Conf. on Computer Vision, Barcelona, 2011 (IEEE, 2011), pp. 2983–2991. https://doi.org/10.1109/ICCV.2011.6126349
G. A. Sigurdsson, O. Russakovsky, and A. Gupta, “What actions are needed for understanding human actions in videos,” in IEEE Int. Conf. on Computer Vision (ICCV), Venice, 2017 (IEEE, 2017), pp. 2156–2165. https://doi.org/10.1109/ICCV.2017.235
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, Ed. by I. Guyon, S. Vishwanathan, and R. Garnett (Curran Associates, 2017), pp. 5998–6008.
Google Scholar
C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual representations from unlabeled video,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016 (IEEE, 2016), pp. 98–106. https://doi.org/10.1109/CVPR.2016.18
L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Val Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in Computer Vision – ECCV 2016, Ed. by B. Leibe, J. Matas, N. Sebe, and M. Welling, Lecture Notes in Computer Science, vol. 9912 (Springer, Cham, 2016), pp. 20–36. https://doi.org/10.1007/978-3-319-46484-8_2
Book Google Scholar
C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime tv-L ¹ optical flow,” in Pattern Recognition. DAGM 2007, Ed. by F. A. Hamprecht, C. Schnörr, and B. Jänne, Lecture Notes in Computer Science, vol. 4713 (Springer, Berlin, 2007), pp. 214–223. https://doi.org/10.1007/978-3-540-74936-3_22
Book Google Scholar

Download references

ACKNOWLEDGMENTS

This work is supported by the Local College Capacity Building Project of Shanghai Municipal Science and Technology Commission (project no. 20020500700) and by National Natural Science Foundation of China (project no. 61802250).

Author information

Authors and Affiliations

School of Electronic and Information Engineering, Shanghai University of Electric Power, 200090, Shanghai, China
Jie Shao & Chen Mo

Authors

Jie Shao
View author publications
You can also search for this author in PubMed Google Scholar
Chen Mo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jie Shao or Chen Mo.

Ethics declarations

COMPLIANCE WITH ETHICAL STANDARDS

This article is a completely original work of its authors; it has not been published before and will not be sent to other publications until the PRIA Editorial Board decides not to accept it for publication.

Conflict of Interest

The authors declare that they have no conflicts of interest.

Additional information

Jie Shao received a B.S. and an M.S. degree in the Nanjing University of Aeronautics and Astronautics. Then, she got her PhD at Tongji University. At present, she is an associate professor at Shanghai University of Electric Power. Her current research interest includes computer vision, video surveillance, and human emotion analysis.

Chen Mo received his Bachelor’s degree in electrical engineering and automation from Heilongjiang University of Science and Technology in 2018. He is currently a graduate student in the Department of Electronics and Information Engineering at Shanghai University of Electric Power. His research interest includes video understanding and action anticipation.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jie Shao, Chen Mo Action Anticipation in First-Person Videos with Self-Attention Based Multi-Modal Network. Pattern Recognit. Image Anal. 32, 429–435 (2022). https://doi.org/10.1134/S1054661822020183

Download citation

Received: 07 February 2021
Revised: 19 January 2022
Accepted: 21 January 2022
Published: 06 July 2022
Issue Date: June 2022
DOI: https://doi.org/10.1134/S1054661822020183

Keywords:

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions