Abstract: Vision-language pre-training models learn visual concepts from image-text or video-text pairs, which can be adopted for visual-textual tasks. In this paper, we adopt these concepts as prior ...