Concatenated Masked Autoencoders as Spatial-Temporal Learner: Related Work

cover
27 Feb 2024

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Zhouqiang Jiang, Meetyou AI Lab,

(2) Bowen Wang, Institute for Datability Science, Osaka University,

(3) Tong Xiang, Meetyou AI Lab,

(4) Zhaofeng Niu, Department of Computer Science, Qufu Normal University,

(5) Hong Tang, Department of Information Engineering, East China Jiaotong University,

(6) Guangshun Li, Department of Computer Science, Qufu Normal University,

(7) Liangzhi Li, Meetyou AI Lab.

A. Masked Visual Modeling

Masked visual modeling method [14] learn representations from images disrupted by masking. Essentially, they function as denoising autoencoders, disrupting input signals and reconstructing the original undamaged signals to learn effective representations. This paradigm has spawned a range of derivatives, such as reconstructing masking pixels [15], [16] or restoring lost color channels [17].

The success of masked language modeling [18] in NLP’s self-supervised pre-training, along with the popularity of ViT [5], has sparked extensive research into the use of transformer-based architectures for masked visual modeling within the field of computer vision [19], [1], [20], [21], [22], [23]. BEiT [19], PeCo [20], and iBOT [23] naturally inherit the idea of BERT [18] and propose learning representations from images by predicting discrete tokens.

Some other research [24], [1], [22] focuses on using pixels as prediction targets, which is simpler. MAE [1] and SimMIM [22], by learning to reconstruct missing patches from randomly masked input image patches, achieve good visual representation. Moreover, MAE’s encoder only handles visible patches under a high mask ratio, significantly speeding up training and achieving better transfer performance. MAE has also been straightforwardly extended to the video domain [2], [3], reconstructing masked cubes. However, cube masking is sub-optimal for learning correspondences. SiamMAE [4] proposed an asymmetric masking strategy that retains past frames while reconstructing future frames containing a high proportion of masks, encouraging the model to model motion and match inter-frame correspondences. However, this strategy struggle to model long-term continuous motion information. Therefore, we propose a concatenated information channel masking reconstruction strategy that extends motion modeling to a theoretically infinite frame interval.

B. Contrastive Based Self-supervision

Effectively utilizing the temporal dimension is crucial in self-supervised video representation learning. Currently, there are various pretext tasks for pre-training, including predicting the future [25], [26], [27], segmenting pseudo ground truth [28], reconstructing future frames [10], [11], [29], tracking [30], [31], reference coloring [32], and temporal ordering [33], [34], [35], [36]. More advanced contrastive learning methods [37], [38] have been developed, which learn representations by modeling image similarities and dissimilarities [39], [40] or solely similarities [41], [42], [43]. However, these methods rely on large batches [40], multicrops [44], negative key queues [39], or custom strategies to prevent representation collapse [42]. Their performance greatly depends on the choice of image augmentation [40]. In contrast, our method is based on a simple masking and reconstruction pipeline [1].