6

scikit-learn 笔记 - 非监督学习

 2 years ago
source link: https://airgiser.github.io/2018/10/25/scikit-learn-unsupervised-learning/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

聚类 (Clustering)

聚类是一类将数据集划分为不同簇的任务。通常同簇的样本相对不同簇间的样本相似度更高。

K-均值 (K-Means)

K-Means 聚类算法的步骤

  1. 初始化聚类数及各聚类centroid
  2. 分配:根据离centroid的距离将样本归属到某一聚类
  3. 优化:根据聚类所属样本重新计算centroid
  4. 如此重复。
from sklearn.cluster import KMeans

# Compute k-means clustering
kmeans = KMeans(n_clusters=2)
kmeans.fit(X)

# Predict the closest cluster each sample in X belongs to.
kmeans.predict(X)

K-Means 聚类算法的局限

聚类的结果非常依赖初始化的情况,很有可能得到局部最优结果。

因此我们通常需要多次运行K-均值算法(对应sklearn中的n_init参数),每一次都重新进行随机初始化,最后选择最优(代价最低)的结果。

降维 (Dimensionality Reduction)

数据降维是在尽量保留有效信息的前提下压缩数据(减少特征个数)。

主成分分析(PCA)

PCA 可以通过在特征空间转换坐标系 (平移和旋转) 来降低数据维度。

PCA 的目的是找到一个向量表示的直线(或一组向量表示的平面或空间),即主成分。当我们把所有样本都投影到主成分上时,我们希望投影误差能尽可能地小(最大程度保留来自原始数据的信息)。

PCA 实际上将输入的原始特征自动组合成了数量更少的一些新的特征(主成分),同时这些新的特征尽量多的保留了原始特征的信息。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

# training features
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])

# PCA
pca = PCA(n_components=2)
pca.fit(X)
transformed_X = pca.transform(X)
print transformed_X

# Percentage of variance explained by each of the selected components.
print(pca.explained_variance_ratio_)

# Principal axes in feature space, representing the directions of maximum variance in the data.
print pca.components_

# same as pca.transform(X)
print np.dot(X, pca.components_)

# plot components
for origin, transformed in zip(X, transformed_X):
# first component
prj0 = [0, 0]
prj0[0] = pca.components_[0][0] * transformed[0]
prj0[1] = pca.components_[0][1] * transformed[0]
plt.scatter(prj0[0], prj0[1], color = 'r')

# second component
prj1 = [0, 0]
prj1[0] = pca.components_[1][0] * transformed[1]
prj1[1] = pca.components_[1][1] * transformed[1]
plt.scatter(prj1[0], prj1[1], color = 'c')

plt.scatter(origin[0], origin[1], color = 'b')
plt.scatter(transformed[0], transformed[1], color = 'g')
plt.show()

下面是使用 PCA 来获得脸部特征(eigenfaces, 特征脸) 从而结合 SVM 进行人脸识别的案例:
Faces recognition example using PCA and SVM

异常值(Outlier)处理

异常值产生可能的原因

  • 数据录入错误
  • 传感器故障

前两者导致的异常值是需要忽略或舍弃的,而最后一个原因导致的异常值,在某些情况下是需要特别关注的,如欺诈检测。

异常值检测与删除

  1. 通过原始训练集训练模型
  2. 删除训练集中误差很高的部分样本
  3. 使用清理后的训练集进行训练
  4. 可重复进行异常值删除和再次训练
#!/usr/bin/python

def outlierCleaner(pred, features, actual):
"""
Clean away the 10% of points that have the largest
residual errors (difference between the prediction
and the actual).

Return a list of tuples where
each tuple is of the form (age, actual, error).
"""

m = len(features) / 10
data = [(features[i], actual[i], (actual[i]-pred[i])**2) for i in range(0, len(features))]
sorted_data = sorted(data, key=lambda item: item[2], reverse=True)
cleaned_data = sorted_data[m:]

return cleaned_data
if len(cleaned_data) > 0:
# refit your cleaned data!
features, actual, errors = zip(*cleaned_data)
features = numpy.reshape(numpy.array(features), (len(features), 1))
actual = numpy.reshape(numpy.array(actual), (len(actual), 1))
reg.fit(features, actual)

http://scikit-learn.org/stable/unsupervised_learning.html

http://scikit-learn.org/stable/modules/clustering.html#k-means

https://www.naftaliharris.com/blog/visualizing-k-means-clustering/

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans

http://scikit-learn.org/stable/modules/decomposition.html#pca


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK