改进重构权值的局部线性嵌入算法
摘 要
目的 局部线性嵌入(LLE)算法是机器学习、数据挖掘等领域中的一种经典的流形学习算法。为克服LLE算法难以有效处理噪声、大曲率和稀疏采样数据等问题,提出一种改进重构权值的局部线性嵌入算法(IRWLLE)。方法 采用测地线距离来描述结构,重新构造和定义LLE中的重构权值,即在某样本的邻域内,将测地距离与欧氏距离之比定义为结构权值;将测地距离与中值测地距离之比定义为距离权值,再将结构权值与距离权值的乘积作为重构权值,从而将流形的结构和距离两种信息进行有机的结合。结果 对经典的人工数据Swiss roll、S-curve和Helix进行实验,在数据中加入噪声干扰,同时采用稀疏采样的方式来生成数据集,并与原始LLE算法和Hessian局部线性嵌入(HLLE)算法进行比较。实验结果表明,IRWLLE算法对比于LLE算法和HLLE算法,能够更好地保持流形的近邻关系,对流形的展开更加完好。尤其是对于加入噪声的大曲率数据集Helix,IRWLLE展现出极强的鲁棒性。对ORL和Yale人脸数据库进行人脸识别实验,采用最近邻分类器进行识别,将IRWLLE算法的识别结果与LLE算法进行对比。对于ORL数据集,IRWLLE算法识别率为90%,原LLE算法的识别率为85.5%;对于Yale数据集,IRWLLE算法识别率为88%,原LLE算法的识别率为75%,可见IRWLLE在人脸识别率上也有很大提高。结论 本文提出的IRWLLE算法对比于原LLE算法,不仅将流形距离信息引入到重构权值中,而且还将结构信息加入其中,有效减少了噪声和流形外数据点的干扰,所以对于噪声数据具有更强的鲁棒性,能够更好地处理稀疏采样数据和大曲率数据,在人脸识别率上也有较大提升。
关键词
Improved reconstruction weight-based locally linear embedding algorithm
Liu Fangyuan, Xia Kewen, Niu Wenjia(School of Electronics & Information Engineering, Hebei University of Technology, Tianjin 300130, China) Abstract
Objective The development of science and technology has reduced the cost of data collection, increased data in geometric series, and caused data dimension reduction technology to be an important part of machine learning. Manifold learning method is a nonlinear dimensionality reduction technique that is widely used in the visualization, feature extraction, and solving of the computational complexity of high-dimensional data. Locally linear embedding (LLE) algorithm is a classical manifold learning algorithm in machine learning and data mining. The basic idea of the LLE algorithm is that any sampling point and its neighbors form a locally linear plane. Each nearest neighbor corresponds to a weight, and the sampling point can be linearly represented by the principle of minimizing the reconstruction error of the nearest neighbor. An improved reconstruction weight-based LLE (IRWLLE) algorithm is proposed to overcome the problem of the LLE algorithm, including noise, large curvature, and sparse sampling data. Method Geodesic distance is used to describe the structure and reconstruct weights and define the reconstruction weights in the LLE to overcome the shortcomings of the original LLE algorithm, which considers only distance factors and ignores structural factors. Structural and distance weights are added. Any sample is selected as the center point, and the nearest-neighbor point (sample) from the center point is selected as the local neighborhood according to Euclidean distance. In this neighborhood, the ratio of the geodesic distance to the Euclidean distance between the center and neighboring points is defined as the structural weight, and the ratio of the geodesic distance to the median geodesic distance between the center and neighboring points is defined as the distance weight. The product of structural and distance weights is defined as the reconstruction weight; thus, the structure and distance information of the manifold are organically combined. Geometric distance calculation method using the classic Dijkstra algorithm is commonly used in Isomap algorithm. For the distance weight, the median distance of the geodesic distance in a local neighborhood is fixed. A farther distance from the center sample point of a neighboring point indicates a smaller distance weight corresponding to this neighbor point, which is in line with the idea that "a greater distance from the neighborhood center means a smaller contribution to the reconstruction center." The geodesic distance divided by its value reduces the noise effect on the weight to a certain extent. In the structural weight part, the ratio of Euclidean distance to geodesic distance is selected to measure the linearity of the local neighborhood. A greater distance from the linear plane of a neighboring point indicates a smaller contribution to the reconstruction center point and a smaller structural weight of the adjacent point. This notion further emphasizes the importance of structure to weight and enhances noise immunity. Result Classical artificial data, such as Swiss roll, S-curve, and Helix, are experimented, and noise is added to the data. Sparse sampling is used to generate a data set. The proposed algorithm is compared with the original LLE algorithm and Hessian LLE (HLLE) algorithm. Results show that the IRWLLE algorithm is better than the LLE and HLLE algorithms by maintaining the neighbor relation of the manifold and improving the manifold. In particular, IRWLLE exhibits stronger robustness for the large curvature data set Helix. A face recognition experiment on ORL and Yale face databases is conducted by using a nearest-neighbor classifier, and the recognition result of the IRWLLE algorithm is compared with that of the LLE algorithm. For the ORL dataset, the recognition rate of the IRWLLE algorithm is 90%, whereas that of the original LLE algorithm is 85.5%. For the Yale dataset, the recognition rate of the IRWLLE algorithm is 88%, whereas that of the original LLE algorithm is 75%. Therefore, the face recognition rate of IRWLLE is also greatly improved. Conclusion The proposed IRWLLE algorithm is based on the original LLE algorithm. This proposed algorithm not only introduces manifold distance information into reconstruction weight but also adds structural information, thereby effectively reducing the interference from noise and data outside the manifold. The IRWLLE algorithm has high robustness to noise data and can manage sparse sampling and large curvature data. The face recognition rate of the IRWLLE algorithm is also enhanced.
Keywords
Manifold learning locally linear embedding reconstruction weight dimensionality reduction robustness
|