Concatenated Masked Autoencoders as Spatial-Temporal Learner: Method

cover
27 Feb 2024

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Zhouqiang Jiang, Meetyou AI Lab,

(2) Bowen Wang, Institute for Datability Science, Osaka University,

(3) Tong Xiang, Meetyou AI Lab,

(4) Zhaofeng Niu, Department of Computer Science, Qufu Normal University,

(5) Hong Tang, Department of Information Engineering, East China Jiaotong University,

(6) Guangshun Li, Department of Computer Science, Qufu Normal University,

(7) Liangzhi Li, Meetyou AI Lab.

III. METHOD

In this section, We introduce each key component of CatMAE. The overall model architecture is shown in Fig 2. Patch Embedding. First, we chronologically select N frames from a video clip as input sequence. The interval between frames is randomly chosen from a pre-determined frame gap range. Then, following the process of ViT [5], we divide each frame into non-overlapping patches. We apply linear projection and flatten the patches, concatenate a [CLS] token to patches and eventually add position embedding to them.

Concatenated Information Channel Masking. We keep the first frame completely visible and apply masking with a high masking rate to N−1 subsequent frames, preserving only a minimal number of visible patches. Given that video signals are highly redundant [8], high masking rate prevents the model from utilizing information from adjacent frames during the reconstruction process, thus encouraging the encoder to capture motion information and correspondence.

Fig. 2: Pipeline of our CatMAE. During pre-training, we chronologically extract N frames from a video clip, keeping all patches of the first frame visible, and apply a very high masking ratio to mask the patches of the subsequent N−1 framesrandomly. The visible patches of N frames are independently processed by the ViT [5] encoder. The decoder reconstructs the

Visible patches in subsequent frames provide information channels, theoretically allowing the frames’ sequence to be reconstructed indefinitely. To reconstruct the final frame, the model needs to reconstruct the continuous motion and correspondence of the intermediate frames, ultimately realizing the evolution between frames. (as shown in Fig. 1).

Encoder. We employ weight-shared vanilla ViT [5] to independently process all frames, with each encoder only encoding visible patches. This design significantly reduces temporal and memory complexity while achieving better performance [1]. Furthermore, the patch embedding only adds a position embedding in the spatial structure, hence, the encoder is unaware of the temporal structure of the patches. However, the decoder’s cross-attention layer needs to use the encoder’s output to propagate the visible content. Even when the temporal structure is unknown, the encoder is forced to match the correspondence between spatial-temporal patches before and after motion. This correspondence assists the decoder in utilizing the information channels to propagate visible content, ultimately achieving the reconstruction of subsequent frames.

Decoder. We also employ a weight-shared decoder to separately predict the reconstructed frames’ masked patches. Each decoder block consists of a cross-attention layer and a self-attention layer [45]. Specifically, the visible tokens of the frame to be reconstructed are projected via a linear layer and combined with mask tokens to form a set of full tokens, to which spatial position embedding is then added. Subsequently, the full tokens attend to all previously visible tokens (tokens of the first frame and other visible tokens) through a cross-attention layer, followed by mutual attention through a self-attention layer. Our concatenated information channel masking enables the decoder to enhance the encoder’s ability to estimate motion offsets and learn correspondences. Finally, The output sequence of the decoder is used to predict the normalized pixel values in the masked patches [1], with an L2 loss applied between the decoder’s prediction and ground truth.