Text-Video-to-Motion
multimodal input motion generation
1 Introduction
Current research on motion generation primarily focuses on text-based prompts. However, this approach often faces challenges due to limited data availability. Additionally, it can be difficult for users to precisely describe desired motions in text. Leveraging advancements in multimodal models, we can now extract semantic meaning from videos and images more effectively. Our goal is to develop an end-to-end model that enables users to generate motions that align with their limitless imagination.
2 Data Preparation
Our advanced A3D product can generate motion data from videos, allowing us to create an unlimited number of video-motion paired datasets. Leveraging powerful multimodal large language models (MLLMs), such as VideoChat2, Gemini 1.5, and MPlug2, as well as models with text decoders like Valor, we can produce synthetic text-video-motion pairs. We use similarity metrics between text and video to select relatively "accurate" data. Additionally, our A3D pipeline incorporates multiple mechanisms to assess the quality of the generated motion data, enabling us to choose high-quality video-motion pairs. For more details, please refer to the #data_synthesis project.
3 Encoder selection
Among the various video-text aligned multimodal models, we selected Valor and InternVideo as the encoders for our experiments.
The reasons for choosing InternVideo are as follows:
  1. InternVideo aligns on global tokens, which facilitates easier modal alignment during the motion transformer training phase.
  2. InternVideo is also trained on images, which can be interpreted as single-frame videos, better aligning with our initial objectives.
The reasons for selecting Valor are:
  1. After testing in our dataset, Valor achieved the best retrieval scores.
4 Training
Once we have the synthetic text-video-motion data, we train our model using the following methods:
  1. Masking out text embeddings.
  2. Masking out visual embeddings.
  3. Using both text and visual embeddings.
We train the model by randomly masking out prompt embeddings within each batch. the ratio of text mask out, video mask out and not masking out are: 0.4, 0.4, 0.2. Strategies 1 and 2 align with our initial goal, where users provide only one type of prompt to generate motion. Strategy 3 aims to enhance generation quality based on Valor's theory. We chose not to use a modality adapter, such as Q-Former in BLIP-2, because our data is synthetic. We suspect that training an adapter on synthetic data might not effectively improve performance in real-world scenarios.
It is worth mentioning that Valor does not perform one-token alignment but rather aligns entire sequences of tokens (multi-to-multi calculation), maximizing the similarity among all tokens. Consequently, the number of tokens needs to be adjusted accordingly.
5 Ablation study
5.1 Connection Methods
Figure 1
We tested three different methods for connecting prompt embeddings to the motion transformer:
  1. Putting them as the first couple tokens in the input in Fig 1.
  2. Using them via cross-attention layers, as illustrated in Fig 2.
  3. As shown in Fig 1&3. In the case of InternVideo, inspired by the Segment Anything Model (SAM), we do not differentiate between prompt types but treat them uniformly.
For Method 3, we adjusted our training strategy accordingly, which means that Strategy 3 in training section was not employed during training.
Figure 2
Figure 3
5.2 Valor vs InternVideo
We train the motion transformer using different modality encoders and select the best-performing one.
6 Evaluation Metrics
We evaluate our model using a private test set to assess performance in our use cases. Table 1 compares the text-to-motion model, which was previously built and trained with synthetic data using only text-motion paired data, against the text-video-to-motion model trained with synthetic data utilizing both text and video modalities. The results demonstrate that incorporating visual prompts did not negatively impact performance with text prompts.
Table 1
Table 2 indicates that the performance of our visual prompt is promising when compared to the text-to-motion performance.
Table 2
7 Visualization Results
As observed, the model effectively supports different modalities and generates results based on the semantic meaning of various prompt types.
8 Conclusion
Our motion transformer handles multimodal inputs effectively without the need for expanding the backbone architecture. However, incorporating dual modalities does not enhance the performance of the text-to-motion model alone.
9 Future
To support fused-modality input, the model can be adapted to handle inputs such as "a man is waving his hands like <img>."
10 Reference
Chen, S., He, X., Guo, L., Zhu, X., Wang, W., Tang, J., & Liu, J. (2023, April 17). Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv.org. https://arxiv.org/abs/2304.08345
Wang, Y., Li, K., Li, Y., He, Y., Huang, B., Zhao, Z., Zhang, H., Xu, J., Liu, Y., Wang, Z., Xing, S., Chen, G., Pan, J., Yu, J., Wang, Y., Wang, L., & Qiao, Y. (2022, December 7). InternVideo: General Video Foundation models via generative and Discriminative Learning. arXiv.org. https://arxiv.org/abs/2212.03191
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollár, P., & Girshick, R. (2023, April 5). Segment anything. arXiv.org. https://arxiv.org/abs/2304.02643
Li, K., He, Y., Wang, Y., Li, Y., Wang, W., Luo, P., Wang, Y., Wang, L., & Qiao, Y. (2024, January 4). VideoChat: Chat-centric video understanding. arXiv.org. https://arxiv.org/abs/2305.06355
Xu, H., Ye, Q., Yan, M., Shi, Y., Ye, J., Xu, Y., Li, C., Bi, B., Qian, Q., Wang, W., Xu, G., Zhang, J., Huang, S., Huang, F., & Zhou, J. (2023, February 1). MPLUG-2: A modularized multi-modal foundation model across text, image and video. arXiv.org. https://arxiv.org/abs/2302.00402
Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., Mariooryad, S., Ding, Y., Geng, X., Alcober, F., Frostig, R., Omernick, M., Walker, L., Paduraru, C., Sorokin, C., … Vinyals, O. (2024, August 8). Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv.org. https://arxiv.org/abs/2403.05530
Li, J., Li, D., Savarese, S., & Hoi, S. (2023, June 15). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv.org. https://arxiv.org/abs/2301.12597
Guo, C., Mu, Y., Javed, M. G., Wang, S., & Cheng, L. (2023, November 29). Momask: Generative masked modeling of 3D human motions. arXiv.org. https://arxiv.org/abs/2312.00063
Zhang, J., Zhang, Y., Cun, X., Huang, S., Zhang, Y., Zhao, H., Lu, H., & Shen, X. (2023, September 24). T2M-GPT: Generating human motion from textual descriptions with discrete representations. arXiv.org. https://arxiv.org/abs/2301.06052