KNN(k-Nearest Neighbours)#
KNN,是一种分类算法,无需训练,通过 输入数据点与训练集中距离其最近的k个数据点 的类别来判断当前数据点的类别;其中训练集数据被直接存储下来。
3种距离尺度#
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
|
class kNN():
'''k-Nearest Neighbours'''
# Initialise
def __init__(self, k=3, metric='euclidean', p=None):
self.k = k
self.metric = metric
self.p = p
# Euclidean distance (l2 norm)
def euclidean(self, v1, v2):
return np.sqrt(np.sum((v1-v2)**2))
# Manhattan distance (l1 norm)
def manhattan(self, v1, v2):
return np.sum(np.abs(v1-v2))
# Minkowski distance (lp norm)
def minkowski(self, v1, v2, p=2):
return np.sum(np.abs(v1-v2)**p)**(1/p)
# kNN 算法没有传统的“训练”阶段来学习模型参数。
# fit 方法只是简单地存储训练数据集 X_train(特征)和 y_train(标签),以便在预测时使用。
# Store train set
def fit(self, X_train, y_train):
self.X_train = X_train
self.y_train = y_train
# Make predictions
def predict(self, X_test):
preds = []
# Loop over rows in test set
for test_row in X_test:
nearest_neighbours = self.get_neighbours(test_row)
majority = stats.mode(nearest_neighbours)[0][0]#返回一个出现最频繁的元素,
preds.append(majority)
return np.array(preds)
# Get nearest neighbours
def get_neighbours(self, test_row):
distances = list()
# Calculate distance to all points in X_train
for (train_row, train_class) in zip(self.X_train, self.y_train):
if self.metric=='euclidean':
dist = self.euclidean(train_row, test_row)
elif self.metric=='manhattan':
dist = self.manhattan(train_row, test_row)
elif self.metric=='minkowski':
dist = self.minkowski(train_row, test_row, self.p)
else:
raise NameError('Supported metrics are euclidean, manhattan and minkowski')
distances.append((dist, train_class))
# Sort distances
distances.sort(key=lambda x: x[0])
# Identify k nearest neighbours
neighbours = list()
for i in range(self.k):
neighbours.append(distances[i][1])
return neighbours
|
k-Means#
通过选择k个中心点,进行聚类。具体实施:初始化k个中心点(它们之间的距离尽可能的远),根据这些点进行聚类(遍历每一个数据样本 $x_i$,计算它到 $k$ 个中心点的距离,把它归类到距离最近的那个中心点所在的簇(Cluster)),分别根据簇中所有点的坐标(求平均)计算得到新的中心点,再次聚类,再次计算新的中心点,直到中心点不再更新或者中心点变化极小(小于我们设定的阈值)
The resulting clusters are spherical,表明k-Means对数据分布为非类圆形的不友好,会将其分为多个簇
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
| class kMeans():
'''k-Means Clustering'''
# Initialise
def __init__(self, k=5, max_iter=100, tol=0.0001):
self.k= k
self.max_iter = max_iter
self.tol = tol
# Euclidean distance (l2 norm)
def euclidean(self, v1, v2):#欧式距离
return np.sqrt(np.sum((v1-v2)**2))
# Train model
def fit(self, X_train):
# Save train set for update stage
self.X_train = X_train
# Sample k points from X for initial centroids
idx = np.random.randint(len(X_train), size=self.k)
self.centroids = X_train[idx,:]
self.clusters = np.zeros(len(X_train))
# Iterate
for i in range(self.max_iter):
# Update clusters
self.update_clusters()
# Update centroids
early_stop = self.update_centroids()
# Early stopping if converged
if early_stop==True:
print(f'Early stopping occured after {i} iterations')
break
# Calculate which cluster each point belongs to
def update_clusters(self):
for row_idx, train_row in enumerate(self.X_train):
dist = []
for i in range(self.k):
dist.append(self.euclidean(train_row, self.centroids[i]))
self.clusters[row_idx] = np.argmin(np.array(dist))
# Calculate center of each cluster
def update_centroids(self):
# Loop over k clusters
new_centroids = np.copy(self.centroids)
for i in range(self.k):
new_centroids[i] = np.mean(self.X_train[self.clusters==i], axis=0)
# Check for convergence
if np.linalg.norm(new_centroids-self.centroids)>self.tol:
self.centroids = new_centroids
return False
else:
self.centroids = new_centroids
return True
# Make predictions
def predict(self, X_test):
predictions = np.zeros(len(X_test))
for row_idx, test_row in enumerate(X_test):
dist = []
for i in range(self.k):
dist.append(self.euclidean(test_row, self.centroids[i]))
predictions[row_idx] = np.argmin(np.array(dist))
return predictions
|
Gaussian Mixture Model (GMM)#
GMM没有The resulting clusters are spherical这样的缺点
We will see that they produce clusters with Gaussian distributions, which are much more flexible, and reflective of real-life data.
已知分布参数or所有数据属于哪个分布,均可以还原分布。但是现在哪个都不知道,统计每个点属于k个类的可能性,用这些可能性更新分布参数
In k-Means, we label each data points as coming from one of the k classes, whereas in EM, we find the probability that each data point belongs to each of the k classes. These probabilities are then used to update the Gaussian parameters (scenario 1, called the M-step). Finally, we iterate to convergence, always alternating between the E-step and the M-step.
假设k=2的Gaussian Mixture,随机生成$\mu_1, \mu_2, \sigma_1^2, \sigma_2^2$以及$(\pi_1, \pi_2)=(0.5,0.5)$
GMMs use the probabilities to estimate the means and variances as weighted averages.