Current Issue Cover
视觉-语义双重解纠缠的广义零样本学习

韩阿友1,2, 杨关1,2, 刘小明1,2, 刘阳3(1.中原工学院计算机学院, 郑州 450007;2.河南省网络舆情监测与智能分析重点实验室, 郑州 450007;3.西安电子科技大学通信工程学院, 西安 710071)

摘 要
目的 传统的零样本学习(zero-shot learning,ZSL)旨在依据可见类别的数据和相关辅助信息对未见类别的数据进行预测分类,而广义零样本学习(generalized zero-shot learning,GZSL)中分类的类别既可能属于可见类也可能属于不可见类,这更符合现实的应用场景。基于生成模型的广义零样本学习的原始特征和生成特征不一定编码共享属性所指的语义相关信息,这样会导致模型倾向于可见类,并且分类时忽略了语义信息中与特征相关的有用信息。为了分解出相关的视觉特征和语义信息,提出了视觉-语义双重解纠缠框架。方法 首先,使用条件变分自编码器为不可见类生成视觉特征,再通过一个特征解纠缠模块将其分解为语义一致性和语义无关特征。然后,设计了一个语义解纠缠模块将语义信息分解为特征相关和特征无关的语义。其中,利用总相关惩罚来保证分解出来的两个分量之间的独立性,特征解纠缠模块通过关系网络来衡量分解的语义一致性,语义解纠缠模块通过跨模态交叉重构来保证分解的特征相关性。最后,使用两个解纠缠模块分离出来的语义一致性特征和特征相关语义信息联合学习一个广义零样本学习分类器。结果 实验在 4 个广义零样本学习公开数据集 AWA2(animals with attributes2)、CUB(caltech-ucsd birds-200-2011)、SUN(SUN attribute)和 FLO(Oxford flowers)上取得了比 Baseline 更好的结果,调和平均值在 AwA2、CUB、SUN 和 FLO 上分别提升了 1.6%、3.2%、6.2% 和 1.5%。结论 在广义零样本学习分类中,本文提出的视觉-语义双重解纠缠方法经实验证明比基准方法取得了更好的性能,并且优于大多现有的相关方法。
关键词
Visual-semantic dual-disentangling for generalized zero-shot learning

Han Ayou1,2, Yang Guan1,2, Liu Xiaoming1,2, Liu Yang3(1.School of Computer Science, Zhongyuan University of Technology, Zhengzhou 450007, China;2.Henan Key Laboratory on Public Opinion Intelligent Analysis, Zhengzhou 450007, China;3.School of Telecommunications Engineering, Xidian University, Xi'an 710071, China)

Abstract
Objective Traditional deep learning models, widely adopted in many application scenarios, perform effectively.However, they rely on a large number of training samples.Thus, a large number of training samples are difficult to collect in practical applications.Moreover, the limitation of identifying only the classes already present in the training phase(seen classes)is bypassed, and processing the classes never seen in the training phase(unseen classes)is a challenge.Zero-shot learning(ZSL)provides a good solution to this challenge.Zero-shot learning aims to classify unseen classes for which no training samples are available during the training phase.However, another problem exists with the complexity of the real world.In practice, seen and unseen classes can be found in real life.Therefore, generalized zeroshot learning(GZSL)is proposed.This new method has realistic and universal characteristics.As a generalized method, generalized zero-shot learning can sample test sets from seen and unseen classes.The existing generalized zero-shot learning methods can be subdivided into two categories, namely, embedding-based and generation-based methods.The former learns a projection or embedding function that associates the visual features of the seen class with the corresponding semantics.In comparison, the latter learns a generative model to generate visual features for the unseen class.In previous studies, the visual features extracted using the pretrained deep models(e.g., ResNet101)are not specifically extracted for the generalized zero-shot learning task.The extracted visual features are not all semantically related to the predefined attributes in dimension.This scenario leads the model to incline to the seen classes.Most methods ignore useful information related to the features present in the semantics during classification, thereby remarkably impacting the final classification.In this paper, we propose a new generalized zero-shot learning method, called visual-semantic dual-disentangling framework for generalized zero-shot learning(VSD-GZSL), to disentangle relevant visual features and semantic information.Method The conditional variational auto-encoders(VAEs)are combined with a disentanglement network and trained in an end-to-end manner.The proposed disentanglement network is an encoder-decoder structure.The visual features and semantics of the seen classes are first used to train the conditional variational auto-encoders and the disentanglement network.Once the network has converged, the trained generative network generates visual features for the unseen classes.The real features of the seen classes and the generative features of the unseen classes are fed into a visual feature disentanglement network to disentangle the semantic-consistent and semantic-irrelevant features.Then, they are fed into a semantic disentanglement network to disentangle the semantic into feature-relevant and feature-irrelevant semantic information.The components disentangled by the two disentanglement networks are fed into the decoder to be reconstructed back to the corresponding space by using the reconstruction loss to prevent information loss during the disentanglement stage.A total correlation penalty module is designed to measure the independence between potential variables disentangled by the disentanglement network.A relational network is designed to maximize the compatibility score between the components disentangled by the visual disentanglement network and the corresponding semantics and learn the semantic consistency of the visual features.The semantic information related to the visual features disentangled by the semantic disentanglement network is fed into the visual disentanglement decoder for cross-modal reconstruction to measure the feature relevance of the semantics.Finally, the semantic consistency features and feature-related semantics disentangled by the two disentanglement networks are jointly learned into a generalized zero-shot classifier for classification.Result The proposed method was validated in several experiments on four generalized zero-shot learning open datasets(AwA2, CUB, SUN, and FLO).The proposed method achieved better results than the baseline, with a 3.8% improvement in the unseen class accuracy, a 0.2% improvement in the seen class accuracy, and a 1.6% improvement in the harmonic mean on the AwA2 dataset.The unseen class accuracy improved by 3.8%, the seen class accuracy improved by 2.4%, and the harmonic mean improved by 3.2% on the CUB dataset.The unseen class accuracy improved by 10.1%, the seen class accuracy improved by 4.1%, and the harmonic mean improved by 6.2% on the SUN dataset.Moreover, the seen class accuracy improved by 9.1%, and the harmonic mean improved by 1.5% on the FLO dataset.The proposed method was also compared with 10 recently proposed generalized zero-shot learning methods.Compared with f-CLSWGAN, VSD-GZSL exhibited improved harmonic means by 10%, 8.4%, 8.1%, and 5.7% on the four datasets.Compared with cycle-consistent adversarial networks for zero-shot learning(CANZSL), VSDGZSL exhibited improved harmonic means by 12.2%, 5.6%, 7.5%, and 4.8% on the four datasets.Compared with leveraging invariant side GAN(LisGAN), VSD-GZSL exhibited improved harmonic means by 8.1%, 6.5%, 7.3%, and 3% on the four data sets.Compared with cross- and distribution-aligned VAE(CADA-VAE), VSD-GZSL exhibited improved harmonic means by 6.5%, 5.7%, 6.9%, and 10% on the four datasets.Compared with f-VAEGAN-D2, VSD-GZSL exhibited improved harmonic means by 6.9%, 4.5%, 6.2%, and 6.7% on the four datasets.Compared with CycleCLSWGAN, VSD-GZSL exhibited improved harmonic means by 5.1%, 8.1%, and 6.2% on the CUB, SUN, and FLO datasets, respectively.Compared with feature refinement(FREE), VSD-GZSL exhibited improved harmonic means by 3.3%, 0.4%, and 5.8% on the AwA2, CUB, and SUN datasets, respectively.The experimental results showed that the proposed method achieves excellent results.Thus, the effectiveness of the proposed method can be demonstrated.Conclu- sion The proposed VSD-GZSL method demonstrates its superiority to the traditional models.Our proposed method can disentangle the semantically consistent features in the visual features and the semantic information associated with the features in the semantics.Then a final classifier is learned from these two decomposed mutually consistent features.Compared with several related methods, VSD-GZSL achieves a remarkable performance improvement on multiple datasets.
Keywords

订阅号|日报