Current Issue Cover
一种基于几何概率的聚类有效性函数

李晓雯1, 毛政元1, 李建微1(福州大学福建省空间信息工程研究中心,空间数据挖掘与信息共享教育部重点实验室,福州 350002)

摘 要
聚类有效性是聚类分析中尚未解决的基本问题,最佳聚类数的确定是聚类有效性问题中的主要研究内容。以几何概率为理论依据,针对2维数据集提出了一种新的聚类有效性函数,用于确定最佳聚类数。该函数利用2维数据集与2维离散点集之间存在的对应关系,以2维离散点集在特征空间中的分布特征为依据,测度对应数据集的聚类结构,思路直观、容易理解。测度过程中,将点集中的点两两相连生成一个线段集合保存点集的结构信息,通过比较线段集合中线段方向取值与完全随机条件下线段方向取值的相对大小,构造聚类有效性函数。实验结果表明,针对给定的样本数据集,生成该函数的曲线,再根据曲线的形态能够有效地确定2维数据集的最佳聚类数,指导聚类算法设计。
关键词
A Cluster Validity Function Based on Geometric Probability

()

Abstract
Determining optimum cluster number is a key research topic included in cluster validity, a fundamental unsolved problem in cluster analysis. In order to determine the optimum cluster number, this article proposes a new cluster validity function for two dimensional datasets theoretically based on geometric probability. The function uses of the relationship between a two dimensional dataset and the corresponding two dimensional discrete point set to measure the cluster structure of the dataset according to the distributive feature of the point set in the characteristic space. It is designed from the perspective of intuition and thus can be easily understood. During the process of measurement, the structure information of the point set has been stored in a line segment set generated by connecting each pair points in the point set. The cluster validity function is formed by comparing the values of line segment direction in the line segment set with those resulted from completely random condition. In the case study, it is testified that the pattern of the function curve generated with a given example dataset effectively enables the determination of the optimum cluster number of the dataset and supports the design of cluster algorithms.
Keywords

订阅号|日报