Abstract
In this paper, we propose a self-attention based multi-modal LSTM framework for the challenging task of action anticipation in first-person videos. Our framework comprehensively considers three video features: RGB images for spatial information, optical flow fields for temporal information, and object-based features to figure out which object the camera wearer interacts with. Different from some previous works that directly utilize features after convolutional layers, we encode multi-modal features by a self-attention mechanism based on the similarity between text sequences and video sequences. The positional vector based on trigonometric function is added to encode the position of the frame so that the self-attention module can learn the position information of the sequence. We use multi-modal LSTMs to load the historical information of the video and generate predictions at different anticipation times. The performance of the proposed method is evaluated on two benchmark datasets, which shows that our framework outperforms the state-of-the-art approaches on metrics, and solved the problem of poor long-term prediction.
Similar content being viewed by others
REFERENCES
J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization” (2016). arXiv:1607.06450
M. Chen, Z. Xu, K. Q. Weinberger, and F. Sha, “Marginalized denoising autoencoders for domain adaptation,” in Proc. 29th Int. Conf. on Machine Learning, Edinburgh, 2012 (Omnipress, Madison, Wis., 2012), pp. 1627–1634. arXiv: 1206.4683
D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray, “Scaling egocentric vision: The epic-kitchens dataset,” in Computer Vision – ECCV 2018, Ed. by V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Lecture Notes in Computer Science, vol. 11208 (Springer, Cham, 2018), pp. 753–771. https://doi.org/10.1007/978-3-030-01225-0_44
A. Furnari and G. M. Farinella, “What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention,” in IEEE/CVF Int. Conf. on Computer Vision (ICCV), Seoul, 2019 (IEEE, 2019), pp. 6251–6260. https://doi.org/10.1109/ICCV.2019.00635
A. Furnari and G. M. Farinella, “Rolling-unrolling LSTMs for action anticipation from first-person video,” IEEE Trans. Pattern Anal. Mach. Intell. 43, 4021–4036 (2021). https://doi.org/10.1109/TPAMI.2020.2992889
A. Furnari, S. Battiato, and G. M. Farinella. “Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation,” in Computer Vision – ECCV 2018 Workshops, Ed. by L. Leal-Taixé and S. Roth, Lecture Notes in Computer Science, vol. 11133 (Springer, Cham, 2019), pp. 389–405. https://doi.org/10.1007/978-3-030-11021-5_24
Y.Gao, D. P. Barrett, A. Barbu, S. Narayanaswamy, H. Yu, A. Michaux, Y. Lin, S. Dickinson, J. M. Siskind, and S. Wang, “Recognize human activities from partially obsereved videos,” in IEEE Conf. on Computer Vision and Pattern Recognition, Portland, 2013 (IEEE, 2013), pp. 2658–2665. https://doi.org/10.1109/CVPR.2013.343
J. Gao, Z. Yang, and R. Nevatia, “Red: Reinforced encoder decoder networks for action anticipation,” in Proc. British Mach. Vision Conf., London, 2017. arXiv:1707.04818
R. De Geest and T. Tuytelaars, “Modeling temporal structure with LSTM for online action detection,” in IEEE Winter Conf. on Applications of Computer Vision (WACV), Lake Tahoe, Nev., 2018 (IEEE, 2018), pp. 1549–1557. https://doi.org/10.1109/WACV.2018.00173
R. Girshick, “Fast R-CNN,” in IEEE Int. Conf. on Computer Vision (ICCV), Santiago, 2015 (IEEE, 2015), pp. 1440–1448. https://doi.org/10.1109/ICCV.2015.169
R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He, “Detectron” (2018). https://github.com/facebookresearch/detectron.
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016 (IEEE, 2016), pp. 770–778. https://doi.org/10.1109/CVPR.2016.90
Y. Kong, D. Kit, and Y. Fu, “A discriminative model with multiple temporal scales for action prediction,” in Computer Vision – ECCV 2014, Ed. by D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, Lecture Notes in Computer Science, vol. 8693 (Springer, Cham, 2014), pp. 596–611. https://doi.org/10.1007/978-3-319-10602-1_39
Y. Kong and Y. Fu, “Max-margin action prediction machine,” IEEE Trans. Pattern Anal. Mach. Intell. 38, 1844–1858 (2015). https://doi.org/10.1109/TPAMI.2015.2491928
Y. Kong, Z. Tao, and Y. Fu, “Deep sequential context networks for action prediction,” in IEEE Conf. on Computer Vision and Pattern Recognition, Honolulu, Hawaii, 2017 (IEEE, 2017), pp. 3662–3670. https://doi.org/10.1109/CVPR.2017.390
Y. Li, M. Liu, and J. M. Rehg, “In the eye of beholder: Joint learning of gaze and actions in first person video,” in Computer Vision – ECCV 2018, Ed. by V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss, Lecture Notes in Computer Science, vol. 11209 (Springer, Cham, 2018), pp. 639–655. https://doi.org/10.1007/978-3-030-01228-1_38
S. Ma, L. Sigal, and S. Sclaroff, “Learning activity progression in lstms for activity detection and early detection,” in IEEE Conf. on Computer Vision and Pattern Recognition, Las Vegas, 2016 (IEEE, 2016), pp. 1942–1950. https://doi.org/10.1109/CVPR.2016.214
M. Hoai and F. De la Torre, “Max-margin early event detectors,” Int. J. Comput. Vision 107, 191–202 (2014). https://doi.org/10.1007/s11263-013-0683-3
M. S. Ryoo, “Human activity prediction: Early recognition of ongoing activities from streaming videos,” in Int. Conf. on Computer Vision, Barcelona, 2011 (IEEE, 2011), pp. 2983–2991. https://doi.org/10.1109/ICCV.2011.6126349
G. A. Sigurdsson, O. Russakovsky, and A. Gupta, “What actions are needed for understanding human actions in videos,” in IEEE Int. Conf. on Computer Vision (ICCV), Venice, 2017 (IEEE, 2017), pp. 2156–2165. https://doi.org/10.1109/ICCV.2017.235
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, Ed. by I. Guyon, S. Vishwanathan, and R. Garnett (Curran Associates, 2017), pp. 5998–6008.
C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual representations from unlabeled video,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Las Vegas, 2016 (IEEE, 2016), pp. 98–106. https://doi.org/10.1109/CVPR.2016.18
L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Val Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in Computer Vision – ECCV 2016, Ed. by B. Leibe, J. Matas, N. Sebe, and M. Welling, Lecture Notes in Computer Science, vol. 9912 (Springer, Cham, 2016), pp. 20–36. https://doi.org/10.1007/978-3-319-46484-8_2
C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime tv-L 1 optical flow,” in Pattern Recognition. DAGM 2007, Ed. by F. A. Hamprecht, C. Schnörr, and B. Jänne, Lecture Notes in Computer Science, vol. 4713 (Springer, Berlin, 2007), pp. 214–223. https://doi.org/10.1007/978-3-540-74936-3_22
ACKNOWLEDGMENTS
This work is supported by the Local College Capacity Building Project of Shanghai Municipal Science and Technology Commission (project no. 20020500700) and by National Natural Science Foundation of China (project no. 61802250).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
COMPLIANCE WITH ETHICAL STANDARDS
This article is a completely original work of its authors; it has not been published before and will not be sent to other publications until the PRIA Editorial Board decides not to accept it for publication.
Conflict of Interest
The authors declare that they have no conflicts of interest.
Additional information
Jie Shao received a B.S. and an M.S. degree in the Nanjing University of Aeronautics and Astronautics. Then, she got her PhD at Tongji University. At present, she is an associate professor at Shanghai University of Electric Power. Her current research interest includes computer vision, video surveillance, and human emotion analysis.
Chen Mo received his Bachelor’s degree in electrical engineering and automation from Heilongjiang University of Science and Technology in 2018. He is currently a graduate student in the Department of Electronics and Information Engineering at Shanghai University of Electric Power. His research interest includes video understanding and action anticipation.
Rights and permissions
About this article
Cite this article
Jie Shao, Chen Mo Action Anticipation in First-Person Videos with Self-Attention Based Multi-Modal Network. Pattern Recognit. Image Anal. 32, 429–435 (2022). https://doi.org/10.1134/S1054661822020183
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1134/S1054661822020183