KNN(k-Nearest Neighbours)

KNN,是一种分类算法,无需训练,通过 输入数据点与训练集中距离其最近的k个数据点 的类别来判断当前数据点的类别;其中训练集数据被直接存储下来。

3种距离尺度

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

class kNN():
    '''k-Nearest Neighbours'''
    # Initialise
    def __init__(self, k=3, metric='euclidean', p=None):
        self.k = k
        self.metric = metric
        self.p = p

    # Euclidean distance (l2 norm)
    def euclidean(self, v1, v2):
        return np.sqrt(np.sum((v1-v2)**2))

    # Manhattan distance (l1 norm)
    def manhattan(self, v1, v2):
        return np.sum(np.abs(v1-v2))

    # Minkowski distance (lp norm)
    def minkowski(self, v1, v2, p=2):
        return np.sum(np.abs(v1-v2)**p)**(1/p)
        
    # kNN 算法没有传统的“训练”阶段来学习模型参数。
    # fit 方法只是简单地存储训练数据集 X_train(特征)和 y_train(标签),以便在预测时使用。
    # Store train set
    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train

    # Make predictions
    def predict(self, X_test):
        preds = []
        # Loop over rows in test set
        for test_row in X_test:
            nearest_neighbours = self.get_neighbours(test_row)
            majority = stats.mode(nearest_neighbours)[0][0]#返回一个出现最频繁的元素,
            preds.append(majority)
        return np.array(preds)

    # Get nearest neighbours
    def get_neighbours(self, test_row):
        distances = list()

        # Calculate distance to all points in X_train
        for (train_row, train_class) in zip(self.X_train, self.y_train):
            if self.metric=='euclidean':
                dist = self.euclidean(train_row, test_row)
            elif self.metric=='manhattan':
                dist = self.manhattan(train_row, test_row)
            elif self.metric=='minkowski':
                dist = self.minkowski(train_row, test_row, self.p)
            else:
                raise NameError('Supported metrics are euclidean, manhattan and minkowski')
            distances.append((dist, train_class))

        # Sort distances
        distances.sort(key=lambda x: x[0])

        # Identify k nearest neighbours
        neighbours = list()
        for i in range(self.k):
            neighbours.append(distances[i][1])

        return neighbours

k-Means

通过选择k个中心点,进行聚类。具体实施:初始化k个中心点(它们之间的距离尽可能的远),根据这些点进行聚类(遍历每一个数据样本 $x_i$,计算它到 $k$ 个中心点的距离,把它归类到距离最近的那个中心点所在的簇(Cluster)),分别根据簇中所有点的坐标(求平均)计算得到新的中心点,再次聚类,再次计算新的中心点,直到中心点不再更新或者中心点变化极小(小于我们设定的阈值)

The resulting clusters are spherical,表明k-Means对数据分布为非类圆形的不友好,会将其分为多个簇

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
class kMeans():
    '''k-Means Clustering'''
    # Initialise
    def __init__(self, k=5, max_iter=100, tol=0.0001):
        self.k= k 
        self.max_iter = max_iter
        self.tol = tol
    
    # Euclidean distance (l2 norm)
    def euclidean(self, v1, v2):#欧式距离
        return np.sqrt(np.sum((v1-v2)**2))
    
    # Train model
    def fit(self, X_train):
        # Save train set for update stage
        self.X_train = X_train
        
        # Sample k points from X for initial centroids
        idx = np.random.randint(len(X_train), size=self.k)
        self.centroids = X_train[idx,:]
        self.clusters = np.zeros(len(X_train))
        
        # Iterate
        for i in range(self.max_iter):
            # Update clusters
            self.update_clusters()

            # Update centroids
            early_stop = self.update_centroids()
            
            # Early stopping if converged
            if early_stop==True:
                print(f'Early stopping occured after {i} iterations')
                break
    
    # Calculate which cluster each point belongs to
    def update_clusters(self):
        for row_idx, train_row in enumerate(self.X_train):
            dist = []
            for i in range(self.k):
                dist.append(self.euclidean(train_row, self.centroids[i]))
            self.clusters[row_idx] = np.argmin(np.array(dist))
        
    # Calculate center of each cluster
    def update_centroids(self):
        # Loop over k clusters
        new_centroids = np.copy(self.centroids)
        for i in range(self.k):
            new_centroids[i] = np.mean(self.X_train[self.clusters==i], axis=0)
        
        # Check for convergence
        if np.linalg.norm(new_centroids-self.centroids)>self.tol:
            self.centroids = new_centroids
            return False
        else:
            self.centroids = new_centroids
            return True
    
    # Make predictions
    def predict(self, X_test):
        predictions = np.zeros(len(X_test))
        for row_idx, test_row in enumerate(X_test):
            dist = []
            for i in range(self.k):
                dist.append(self.euclidean(test_row, self.centroids[i]))
            predictions[row_idx] = np.argmin(np.array(dist))
        return predictions

Gaussian Mixture Model (GMM)

GMM没有The resulting clusters are spherical这样的缺点

We will see that they produce clusters with Gaussian distributions, which are much more flexible, and reflective of real-life data.

已知分布参数or所有数据属于哪个分布,均可以还原分布。但是现在哪个都不知道,统计每个点属于k个类的可能性,用这些可能性更新分布参数

In k-Means, we label each data points as coming from one of the k classes, whereas in EM, we find the probability that each data point belongs to each of the k classes. These probabilities are then used to update the Gaussian parameters (scenario 1, called the M-step). Finally, we iterate to convergence, always alternating between the E-step and the M-step.

假设k=2的Gaussian Mixture,随机生成$\mu_1, \mu_2, \sigma_1^2, \sigma_2^2$以及$(\pi_1, \pi_2)=(0.5,0.5)$

GMMs use the probabilities to estimate the means and variances as weighted averages.