论文部分内容阅读
语义向量差异性度量是用深度学习方法解决自然语言处理领域问题的重要基础。高维语义向量差异性度量存在“度量集中”问题,导致传统的度量方法得到的度量结果无法体现语义向量间的差异性。针对该问题,提出一种基于非对称多值特征杰卡德系数的差异性度量方法。由高维语义向量维度值的统计分布得出,部分维度的维度值密集地分布在特定值域内,导致其无法贡献差异度,因此不同维度对差异性的贡献量不同,具有非对称性。该方法定义了关于维度值的重要性函数,选取重要性函数值满足阅值的维度参与差异度计算,去掉无法贡献差异度的维度,实现了降维,缓解了“度量集中”问题。实验分别在渔业数据集和公开数据集上进行,对不同维度的语义向量的不同度量方法进行了比较,在语义性没有明显变差的情况下,提出的方法的多样性指标较目前最优的度量方法有较大幅度的提高。
Semantic vector differences measure is to use deep learning method to solve the problem of natural language processing an important foundation. The difference measure of high-dimensional semantic vector has the problem of “concentration of measurement”, which leads to the fact that the measurement results obtained by traditional measurement methods can not reflect the difference of semantic vectors. Aiming at this problem, a new method based on asymmetric multi-valued features of Jaccard coefficient is proposed. According to the statistical distribution of dimension values of high-dimensional semantic vectors, the dimension values of some dimensions are densely distributed in a specific range, which makes them unable to contribute to the differences. Therefore, different dimensions have different contributions to the differences and are asymmetric. The method defines the importance function of the dimension value, selects the dimension function of importance value satisfying the reading value to participate in the calculation of the difference degree, removes the dimension which can not contribute the difference degree, and realizes the dimension reduction and alleviates the problem of “concentration”. Experiments were carried out on fishery data sets and public data sets, respectively, and different measures of semantic vectors in different dimensions were compared. Under the condition of no significant deterioration in semantic quality, the proposed method has better diversification index than the current best Measurement methods have a more substantial increase.