KNN & 聚类
KNN(k-Nearest Neighbours) KNN,是一种分类算法,无需训练,通过 输入数据点与训练集中距离其最近的k个数据点 的类别来判断当前数据点的类别;其中训练集数据被直接存储下来。 3种距离尺度 class kNN(): '''k-Nearest Neighbours''' # Initialise def __init__(self, k=3, metric='euclidean', p=None): self.k = k self.metric = metric self.p = p # Euclidean distance (l2 norm) def euclidean(self, v1, v2): return np.sqrt(np.sum((v1-v2)**2)) # Manhattan distance (l1 norm) def manhattan(self, v1, v2): return np.sum(np.abs(v1-v2)) # Minkowski distance (lp norm) def minkowski(self, v1, v2, p=2): return np.sum(np.abs(v1-v2)**p)**(1/p) # kNN 算法没有传统的“训练”阶段来学习模型参数。 # fit 方法只是简单地存储训练数据集 X_train(特征)和 y_train(标签),以便在预测时使用。 # Store train set def fit(self, X_train, y_train): self.X_train = X_train self.y_train = y_train # Make predictions def predict(self, X_test): preds = [] # Loop over rows in test set for test_row in X_test: nearest_neighbours = self.get_neighbours(test_row) majority = stats.mode(nearest_neighbours)[0][0]#返回一个出现最频繁的元素, preds.append(majority) return np.array(preds) # Get nearest neighbours def get_neighbours(self, test_row): distances = list() # Calculate distance to all points in X_train for (train_row, train_class) in zip(self.X_train, self.y_train): if self.metric=='euclidean': dist = self.euclidean(train_row, test_row) elif self.metric=='manhattan': dist = self.manhattan(train_row, test_row) elif self.metric=='minkowski': dist = self.minkowski(train_row, test_row, self.p) else: raise NameError('Supported metrics are euclidean, manhattan and minkowski') distances.append((dist, train_class)) # Sort distances distances.sort(key=lambda x: x[0]) # Identify k nearest neighbours neighbours = list() for i in range(self.k): neighbours.append(distances[i][1]) return neighbours k-Means 通过选择k个中心点,进行聚类。具体实施:初始化k个中心点(它们之间的距离尽可能的远),根据这些点进行聚类(遍历每一个数据样本 $x_i$,计算它到 $k$ 个中心点的距离,把它归类到距离最近的那个中心点所在的簇(Cluster)),分别根据簇中所有点的坐标(求平均)计算得到新的中心点,再次聚类,再次计算新的中心点,直到中心点不再更新或者中心点变化极小(小于我们设定的阈值) ...