Concatenated Masked Autoencoders as Spatial-Temporal Learner: Conclusion & References

27 Feb 2024

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Zhouqiang Jiang, Meetyou AI Lab,

(2) Bowen Wang, Institute for Datability Science, Osaka University,

(3) Tong Xiang, Meetyou AI Lab,

(4) Zhaofeng Niu, Department of Computer Science, Qufu Normal University,

(5) Hong Tang, Department of Information Engineering, East China Jiaotong University,

(6) Guangshun Li, Department of Computer Science, Qufu Normal University,

(7) Liangzhi Li, Meetyou AI Lab.

Table of Links

V. CONCLUSION

In this paper, we proposed CatMAE for self-supervised video representation learning. It leverages a concatenated information channel masking strategy to address the limitations posed by cube masking and enhances the capability to capture continuous and long-term motion compared to asymmetric masking. Our experimental results demonstrate superior performance in comparison to state-of-the-art methods across both video segmentation and action recognition tasks. One distinctive feature of our training pipeline is the propagation of reconstruction information from the initial frame throughout the entire video sequence. This theoretically unlimited propagation showcases CatMAE’s potential to learn long-term video representations. Our future work focuses on extending the application of CatMAE to realworld scenarios involving embodied agents, such as robots.

REFERENCES

[1] K. He, X. Chen, S. Xie, Y. Li, P. Dollar, and R. Girshick, “Masked ´ autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009.

[2] Z. Tong, Y. Song, J. Wang, and L. Wang, “Videomae: Masked autoencoders are data-efficient learners for self-supervised video pretraining,” Advances in neural information processing systems, vol. 35, pp. 10 078–10 093, 2022.

[3] C. Feichtenhofer, Y. Li, K. He et al., “Masked autoencoders as spatiotemporal learners,” Advances in neural information processing systems, vol. 35, pp. 35 946–35 958, 2022.

[4] A. Gupta, J. Wu, J. Deng, and L. Fei-Fei, “Siamese masked autoencoders,” arXiv preprint arXiv:2305.14344, 2023.

[5] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.

[6] E. H. Adelson and J. R. Bergen, “Spatiotemporal energy models for the perception of motion,” Josa a, vol. 2, no. 2, pp. 284–299, 1985.

[7] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.

[8] Z. Zhang and D. Tao, “Slow feature analysis for human action recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 34, no. 3, pp. 436–450, 2012.

[9] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbelaez, A. Sorkine-Hornung, ´ and L. Van Gool, “The 2017 davis challenge on video object segmentation,” arXiv preprint arXiv:1704.00675, 2017.

[10] W. Lotter, G. Kreiman, and D. Cox, “Deep predictive coding networks for video prediction and unsupervised learning,” arXiv preprint arXiv:1605.08104, 2016.

[11] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” arXiv preprint arXiv:1511.05440, 2015.

[12] Q. Zhou, X. Liang, K. Gong, and L. Lin, “Adaptive temporal encoding network for video instance-level human parsing,” in Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 1527– 1535.

[13] H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action recognition,” in Proceedings of the IEEE international conference on computer vision, 2013, pp. 3192–3199.

[14] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning, 2008, pp. 1096–1103.

[15] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, P.-A. Manzagol, and L. Bottou, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.” Journal of machine learning research, vol. 11, no. 12, 2010.

[16] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context encoders: Feature learning by inpainting,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 2536–2544.

[17] R. Zhang, P. Isola, and A. A. Efros, “Colorful image colorization,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14. Springer, 2016, pp. 649–666.

[18] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.

[19] H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,” arXiv preprint arXiv:2106.08254, 2021.

[20] X. Dong, J. Bao, T. Zhang, D. Chen, W. Zhang, L. Yuan, D. Chen, F. Wen, N. Yu, and B. Guo, “Peco: Perceptual codebook for bert pre-training of vision transformers,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 1, 2023, pp. 552– 560.

[21] C. Wei, H. Fan, S. Xie, C.-Y. Wu, A. Yuille, and C. Feichtenhofer, “Masked feature prediction for self-supervised visual pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 14 668–14 678.

[22] Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu, “Simmim: A simple framework for masked image modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9653–9663.

[23] J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “ibot: Image bert pre-training with online tokenizer,” arXiv preprint arXiv:2111.07832, 2021.

[24] M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever, “Generative pretraining from pixels,” in International conference on machine learning. PMLR, 2020, pp. 1691–1703.

[25] J. Walker, C. Doersch, A. Gupta, and M. Hebert, “An uncertain future: Forecasting from static images using variational autoencoders,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14. Springer, 2016, pp. 835–851.

[26] A. Gupta, S. Tian, Y. Zhang, J. Wu, R. Mart´ın-Mart´ın, and L. FeiFei, “Maskvit: Masked visual pre-training for video prediction,” arXiv preprint arXiv:2206.11894, 2022.

[27] C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual representations from unlabeled video,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 98– 106.

[28] D. Pathak, R. Girshick, P. Dollar, T. Darrell, and B. Hariharan, ´ “Learning features by watching objects move,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2701–2710.

[29] N. Srivastava, E. Mansimov, and R. Salakhudinov, “Unsupervised learning of video representations using lstms,” in International conference on machine learning. PMLR, 2015, pp. 843–852.

[30] X. Wang, A. Jabri, and A. A. Efros, “Learning correspondence from the cycle-consistency of time,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 2566–2576.

[31] X. Wang and A. Gupta, “Unsupervised learning of visual representations using videos,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2794–2802.

[32] C. Vondrick, A. Shrivastava, A. Fathi, S. Guadarrama, and K. Murphy, “Tracking emerges by colorizing videos,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 391–408.

[33] B. Fernando, H. Bilen, E. Gavves, and S. Gould, “Self-supervised video representation learning with odd-one-out networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3636–3645.

[34] H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang, “Unsupervised representation learning by sorting sequences,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 667–676.

[35] I. Misra, C. L. Zitnick, and M. Hebert, “Shuffle and learn: unsupervised learning using temporal order verification,” in Computer Vision– ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14. Springer, 2016, pp. 527–544.

[36] D. Wei, J. J. Lim, A. Zisserman, and W. T. Freeman, “Learning and using the arrow of time,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8052–8060.

[37] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), vol. 2. IEEE, 2006, pp. 1735–1742.

[38] S. Becker and G. E. Hinton, “Self-organizing neural network that discovers surfaces in random-dot stereograms,” Nature, vol. 355, no. 6356, pp. 161–163, 1992.

[39] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9729–9738.

[40] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning. PMLR, 2020, pp. 1597–1607.

[41] X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 750–15 758.

[42] J.-B. Grill, F. Strub, F. Altche, C. Tallec, P. Richemond, ´ E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al., “Bootstrap your own latent-a new approach to self-supervised learning,” Advances in neural information processing systems, vol. 33, pp. 21 271–21 284, 2020.

[43] M. Caron, H. Touvron, I. Misra, H. Jegou, J. Mairal, P. Bojanowski, ´ and A. Joulin, “Emerging properties in self-supervised vision transformers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660.

[44] M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin, “Unsupervised learning of visual features by contrasting cluster assignments,” Advances in neural information processing systems, vol. 33, pp. 9912–9924, 2020.

[45] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.

[46] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

[47] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.

[48] E. Hoffer, T. Ben-Nun, I. Hubara, N. Giladi, T. Hoefler, and D. Soudry, “Augment your batch: Improving generalization through instance repetition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8129–8138.

[49] A. Jabri, A. Owens, and A. Efros, “Space-time correspondence as a contrastive random walk,” Advances in neural information processing systems, vol. 33, pp. 19 545–19 560, 2020.

[50] X. Chen, H. Fan, R. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” arXiv preprint arXiv:2003.04297, 2020.

[51] J. Xu and X. Wang, “Rethinking self-supervised correspondence learning: A video frame-level similarity perspective,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 075–10 085.

← Previous

Concatenated Masked Autoencoders as Spatial-Temporal Learner: Experiments