Current Issue Cover

张浩宇1, 王天保1, 李孟择1, 赵洲1, 浦世亮2, 吴飞1(1.浙江大学计算机与科学技术学院, 杭州 310013;2.杭州海康威视数字技术股份有限公司, 杭州 310051)

摘 要
Comprehensive review of visual-language-oriented multimodal pre-training methods

Zhang Haoyu1, Wang Tianbao1, Li Mengze1, Zhao Zhou1, Pu Shiliang2, Wu Fei1(1.College of Computer Science and Technology, Zhejiang University, Hangzhou 310013, China;2.Hangzhou Hikvision Digital Technology Co., Ltd., Hangzhou 310051, China)

Multimodal machine learning has been challenging for labor-intensive and labeled cost and data migration constraints,which requires amount of retraining process,resulting in low efficiency and imbalanced resources allocation for multiple training tasks.To learn the internal knowledge representation and meet the requirement of the related downstream visual language multimodal tasks,pre-training model is carried out for large-scale data training task through self-supervision,the multiple modes information extraction and integration of the data set context,etc.The exploration of pre-trained models is focused on cheaper labeled data due to the expensive human labels.First,the model is pre-trained based on cheap labeled data,and the model is fine-tuned using less expensive human annotations.Large-scale data and long time span training are often required to pre-train the model because of the less information and noise derived from cheap labeled data.The large-scale unlabeled-data-based pre-trained model not only transfer the more general knowledge to the target task through the learned unlabeled data,but also get a better parameter initial point through the pre-training learning.The future multimodal contexts have their potentials like learning demonstration,sentiment analysis and task-oriented large-scale human-computer interactions.Multimodal pre-training models can be as a pathway derived of weak artificial intelligence from local to global.It is possible to transfer multi-tasks learning results to non-supervision multi-domains data automatically and quickly.The plain text pre-training model can cover less online data only,and richer data have not been fully utilized and learned.Multimodal-contexts are benefited from information gathering,context perception,knowledge learning,and demonstration.To generate commonly-used artificial intelligence model,the pre-training model has been developing from single-modal to multi-modal.The intensive growth of pre-training models has extended to the field of visual and textual interaction since 2019.Thanks to the large-scale image-text pairs and video data online and the growth of pre-training technique like self-supervised learning,the visual-language multimodal pre-training model has been promoted and bridged the gap between different visual-language tasks,which optimizes multi-task training and improves the performance of specific tasks.Current multimodal researches are challenged to an intelligent system organizing,multimodal information perceiving and the semantic gap bridging.We review existing pre-training datasets and pre-training methods,and propose a systematic overview of the latest and traditional methods.The universals and differences between the methods are critical analyzed,and the experimental conditions of each model are summarized on specific downstream tasks.Finally,the challenges and future research direction of visual language pre-training are predicted.
