支持向量机(后记、附录及文献列表)野史趣闻

2018-11-13 16:33:14

吴天晖/译

后记

最后,我将引用Stuart Russel 和 Peter Norvig的话,他们写道:

你可以说SVMs是成功的,因为它的关键洞察力和优雅的技巧。(Russell & Norvig, 2010)

这个的关键洞察事实上是指一些经验比其他的重要得多。它们靠近决策边缘,我们称它们为支持向量。结果,我们发现这个最优超平面的泛化好于其他超平面,而且仅用支持向量就能构造。我们发现一个细节是我们需要去解一个凸优化问题来寻找这个超平面。

这个的优雅的技巧就是核技巧。它让我们用SVMs解决非可分数据,如果没有它,SVMs的应用将非常有限。我们看到这个技巧,可能刚开始比较难理解,事实上它非常简单,而且可以重用于其他学习算法。

就这样,如果你看完了这本书,你将能明白SVMs是如何工作的。另一个感兴趣的问题是它们为什么能工作?这个主题的领域叫计算学习理论(SVMs事实上来自统计学习理论)。如果你想知道更多关于这方面的知识,你可以跟随精品课程(outstanding course,http://work.caltech.edu/telecourse.html)从数据中学习(Learning from Data (Abu-Mostafa, 2012) ),它对这个主题提供了一个非常好的介绍。

你需要知道SVMs不仅仅只是用来分类。单类SVM可以用来作异常侦测,支持向量回归可以用做回归分析。为了让这本书简洁所以它们不包括在本书中,但是它们机是引人关注的主题。现在你明白了基础的SVMs,你可以有更好的准备去学习这些分支。

SVMs不能解决所有的问题,但是我希望现在它是你的机器学习工具箱中的一个工作,一个你已经理解的工具,并且乐于使用。

附录A:数据集

线性可分数据集

下面的代码列出本书大部分章节都用到的简单线性可分数据集。你可以在Bitbucket repository(https://bitbucket.org/syncfusiontech/svm-succinctly)找到本书用到的其他数据集的源码。

图60: 训练集

图61:检测集

当代码引入代码表51的模型时,在代码52中将加载这个方法。

方法get_training_examples 返回图60所示的数据,方法get_training_examples 返回图61所示的数据。

代码表51
from succinctly.datasets import *
代码表52
import numpy as np
def get_training_examples():
X1 = np.array([[8, 7], [4, 10], [9, 7], [7, 10],
[9, 6], [4, 8], [10, 10]])
y1 = np.ones(len(X1))
X2 = np.array([[2, 7], [8, 3], [7, 5], [4, 4],
[4, 6], [1, 3], [2, 5]])
y2 = np.ones(len(X2)) * -1
return X1, y1, X2, y2
def get_test_examples():
X1 = np.array([[2, 9], [1, 10], [1, 11], [3, 9], [11, 5],
[10, 6], [10, 11], [7, 8], [8, 8], [4, 11],
[9, 9], [7, 7], [11, 7], [5, 8], [6, 10]])
X2 = np.array([[11, 2], [11, 3], [1, 7], [5, 5], [6, 4],
[9, 4],[2, 6], [9, 3], [7, 4], [7, 2], [4, 5],
[3, 6], [1, 6], [2, 3], [1, 1], [4, 2], [4, 3]])
y1 = np.ones(len(X1))
y2 = np.ones(len(X2)) * -1
return X1, y1, X2, y2

代码表53是这个代码的一个典型的应用。它用的get_dataset 方法来自代码表54,用来载入datasets包。

代码表53
from succinctly.datasets import get_dataset, linearly_separable as ls
# Get the training examples of the linearly separable dataset.
X, y = get_dataset(ls.get_training_examples)
代码表54
import numpy as np
def get_dataset(get_examples):
X1, y1, X2, y2 = get_examples()
X, y = get_dataset_for(X1, y1, X2, y2)
return X, y def get_dataset_for(X1, y1, X2, y2):
X = np.vstack((X1, X2))
y = np.hstack((y1, y2))
return X, y
def get_generated_dataset(get_examples, n):
X1, y1, X2, y2 = get_examples(n)
X, y = get_dataset_for(X1, y1, X2, y2)
return X, y

附录B:SMO算法

代码表55
import numpy as np
from random import randrange
# Written from the pseudo-code in:
# http://luthuli.cs.uiuc.edu/~daf/courses/optimization/Papers/smoTR.pdf
class SmoAlgorithm:
def init (self, X, y, C, tol, kernel, use_linear_optim):
self.X = X
self.y = y
self.m, self.n = np.shape(self.X)
self.alphas = np.zeros(self.m)
self.kernel = kernel
self.C = C
self.tol = tol
self.errors = np.zeros(self.m)
self.eps = 1e-3 # epsilon
self.b = 0
self.w = np.zeros(self.n)
self.use_linear_optim = use_linear_optim
# Compute the SVM output for example i
# Note that Platt uses the convention w.x-b=0
# while we have been using w.x+b in the book.
def output(self, i):
if self.use_linear_optim:
# Equation 1
return float(np.dot(self.w.T, self.X[i])) - self.b
else:
# Equation 10
return np.sum([self.alphas[j] * self.y[j]
* self.kernel(self.X[j], self.X[i])
for j in range(self.m)]) - self.b
# Try to solve the problem analytically.
def take_step(self, i1, i2):
if i1 == i2:
return False
a1 = self.alphas[i1] y1 = self.y[i1]
X1 = self.X[i1]
E1 = self.get_error(i1)
s = y1 * self.y2
# Compute the bounds of the new alpha2.
if y1 != self.y2:
# Equation 13
L = max(0, self.a2 - a1)
H = min(self.C, self.C + self.a2 - a1)
else:
# Equation 14
L = max(0, self.a2 + a1 - self.C) H = min(self.C, self.a2 + a1)
if L == H:
return False
k11 = self.kernel(X1, X1)
k12 = self.kernel(X1, self.X[i2])
k22 = self.kernel(self.X[i2], self.X[i2])
# Compute the second derivative of the
# objective function along the diagonal.
# Equation 15
eta = k11 + k22 - 2 * k12
if eta > 0:
# Equation 16
a2_new = self.a2 + self.y2 * (E1 - self.E2) / eta
# Clip the new alpha so that is stays at the end of the line.
# Equation 17
if a2_new < L: a2_new = L
elif a2_new > H: a2_new = H
else:
# Under unusual cicumstances, eta will not be positive.
# Equation 19
f1 = y1 * (E1 + self.b) - a1 * k11 - s * self.a2 * k12 f2 = self.y2 * (self.E2 + self.b) - s * a1 * k12 \
- self.a2 * k22
L1 = a1 + s(self.a2 - L)
H1 = a1 + s * (self.a2 - H)
Lobj = L1 * f1 + L * f2 + 0.5 * (L1 ** 2) * k11 \
+ 0.5 * (L ** 2) * k22 + s * L * L1 * k12 Hobj = H1 * f1 + H * f2 + 0.5 * (H1 ** 2) * k11 \
+ 0.5 * (H ** 2) * k22 + s * H * H1 * k12
if Lobj < Hobj - self.eps: a2_new = L
elif Lobj > Hobj + self.eps: a2_new = H
else:
a2_new = self.a2
# If alpha2 did not change enough the algorithm
# returns without updating the multipliers.
if abs(a2_new - self.a2) < self.eps * (a2_new + self.a2 \
+ self.eps):
return False
# Equation 18
a1_new = a1 + s * (self.a2 - a2_new)
new_b = self.compute_b(E1, a1, a1_new, a2_new, k11, k12, k22, y1)
delta_b = new_b - self.b
self.b = new_b
# Equation 22
if self.use_linear_optim:
self.w = self.w + y1*(a1_new - a1)*X1 \
+ self.y2*(a2_new - self.a2) * self.X2
# Update the error cache using the new Lagrange multipliers.
delta1 = y1 * (a1_new - a1)
delta2 = self.y2 * (a2_new - self.a2)
# Update the error cache.
for i in range(self.m):
if 0 < self.alphas[i] < self.C:
self.errors[i] += delta1 * self.kernel(X1, self.X[i]) + \ delta2 * self.kernel(self.X2,self.X[i]) \
- delta_b
self.errors[i1] = 0
self.errors[i2] = 0
self.alphas[i1] = a1_new self.alphas[i2] = a2_new
return True
def compute_b(self, E1, a1, a1_new, a2_new, k11, k12, k22, y1):
# Equation 20
b1 = E1 + y1 * (a1_new - a1) * k11 + \
self.y2 * (a2_new - self.a2) * k12 + self.b
# Equation 21
b2 = self.E2 + y1 * (a1_new - a1) * k12 + \ self.y2 * (a2_new - self.a2) * k22 + self.b
if (0 < a1_new) and (self.C > a1_new): new_b = b1
elif (0 < a2_new) and (self.C > a2_new): new_b = b2
else:
new_b = (b1 + b2) / 2.0
return new_b
def get_error(self, i1):
if 0 < self.alphas[i1] < self.C:
return self.errors[i1]
else:
return self.output(i1) - self.y[i1]
def second_heuristic(self, non_bound_indices): i1 = -1
if len(non_bound_indices) > 1: max = 0
for j in non_bound_indices:
E1 = self.errors[j] - self.y[j]
step = abs(E1 - self.E2) # approximation
if step > max: max = step i1 = j
return i1
def examine_example(self, i2): self.y2 = self.y[i2] self.a2 = self.alphas[i2] self.X2 = self.X[i2] self.E2 = self.get_error(i2)
r2 = self.E2 * self.y2
if not((r2 < -self.tol and self.a2 < self.C) or
(r2 > self.tol and self.a2 > 0)):
# The KKT conditions are met, SMO looks at another example.
return 0
# Second heuristic A: choose the Lagrange multiplier which
# maximizes the absolute error.
non_bound_idx = list(self.get_non_bound_indexes()) i1 = self.second_heuristic(non_bound_idx)
if i1 >= 0 and self.take_step(i1, i2):
return 1
# Second heuristic B: Look for examples making positive
# progress by looping over all non-zero and non-C alpha,
# starting at a random point.
if len(non_bound_idx) > 0:
rand_i = randrange(len(non_bound_idx))
for i1 in non_bound_idx[rand_i:] + non_bound_idx[:rand_i]:
if self.take_step(i1, i2):
return 1
# Second heuristic C: Look for examples making positive progress
# by looping over all possible examples, starting at a random
# point.
rand_i = randrange(self.m) all_indices = list(range(self.m))
for i1 in all_indices[rand_i:] + all_indices[:rand_i]:
if self.take_step(i1, i2):
return 1
# Extremely degenerate circumstances, SMO skips the first example.
return 0
def error(self, i2):
return self.output(i2) - self.y2
def get_non_bound_indexes(self):
return np.where(np.logical_and(self.alphas > 0,
self.alphas < self.C))[0]
# First heuristic: loop over examples where alpha is not 0 and not C
# they are the most likely to violate the KKT conditions
# (the non-bound subset).
def first_heuristic(self): num_changed = 0
non_bound_idx = self.get_non_bound_indexes()
for i in non_bound_idx:
num_changed += self.examine_example(i)
return num_changed

代码表56示范怎样实例一个SmoAlgorithm对象,运行这个算法,和打印结果。

代码表56
import numpy as np
from random import seed
from succinctly.datasets import linearly_separable, get_dataset
from succinctly.algorithms.smo_algorithm import SmoAlgorithm
def linear_kernel(x1, x2):
return np.dot(x1, x2)
def compute_w(multipliers, X, y):
return np.sum(multipliers[i] * y[i] * X[i] for i in range(len(y)))
if __name__ == '__main__':
seed(5) # to have reproducible results
X_data, y_data = get_dataset(linearly_separable.get_training_examples)
smo = SmoAlgorithm(X_data, y_data, C=10, tol=0.001, k
ernel=linear_kernel, use_linear_optim=True)
smo.main_routine()
w = compute_w(smo.alphas, X_data, y_data)
print('w = {}'.format(w))
# -smo.b because Platt uses the convention w.x-b=0
print('b = {}'.format(-smo.b))
# w = [0.4443664 1.1105648]
# b = -9.66268641132

文献列表

Abu-Mostafa, Y. S. (2012). Learning From Data. AMLBook.

Biernat, E., & Lutz, M. (2016). Data science: fondamentaux et études de cas. Eyrolles. Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge University Press.

Burges, C. J. (1988). A Tutorial on Support Vector Machines for Pattern. Data Mining and Knowledge Discovery, 121-167.

Crammer, K., & Singer, Y. (2001). On the Algorithmic Implementation of Multiclass Kernel- based Vector Machines. Journal of Machine Learning Research 2.

Cristianini, N., & Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines.

Cambridge University Press.

Dogan, U., Glasmachers, T., & Igel, C. (2011). Fast Training of Multi-Class Support Vector Machines.

El Ghaoui, L. (2015). Optimization Models and Applications. Retrieved from http://livebooklabs.com/keeppies/c5a5868ce26b8125

Gershwin, S. B. (2010). KKT Examples. Retrieved from MIT Mechanical Engineering Course: http://ocw.mit.edu/courses/mechanical-engineering/2-854-introduction-to-manufacturing- systems-fall-2010/lecture-notes/MIT2_854F10_kkt_ex.pdf

Gretton, A. (2016, 03 05). Lecture 9: Support Vector Machines . Retrieved from http://www.gatsby.ucl.ac.uk/~gretton/coursefiles/Slides5A.pdf

Han, X., & Berg, A. C. (2012). DCMSVM: Distributed Parallel Training For Single-Machine Multiclass Classifiers.

Hsu, C.-W., & Lin, C.-J. (2002). A Comparison of Methods for Multi-class Support Vector Machines. IEEE transactions on neural networks.

Hsu, C.-W., Chang, C.-C., & Lin, C.-J. (2016, 10 02). A Practical Guide to Support Vector Classification. Retrieved from LIBSVM website: http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf

Ng, A. (n.d.). CS229 Lecture notes - Part V Support Vector Machines. Retrieved from http://cs229.stanford.edu/notes/cs229-notes3.pdf

Osuna, E., Freund, R., & Girosi, F. (1997). An Improved Training Algorithm for Support Vector.

Proceedings of IEEE NNSP'97.

Platt, J. C. (1998). Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines. Microsoft Research.

Platt, J. C., Cristianini, N., & Shawe-Taylor, J. (2000). Large margin DAGs for multiclass classification. MIT Press.

Rojas, R. (1996). Neural Networks: A Systematic Introduction. Springer.

Russell, S., & Norvig, P. (2010). Artificial Intelligence: A Modern Approach. Pearson. Tyson Smith, B. (2004). Lagrange Multipliers Tutorial in the Context of Support Vector

Machines. Newfoundland.

Vapnik, V. (1982). Estimation of Dependences Based on Empirical Data. Springer. Vapnik, V. N. (1998). Statistical Learning Theory. Wiley.

Weston, J., & Watkins, C. (1999). Support Vector Machines for Multi-Class Pattern Recognition.

Proceedings of the Seventh European Symposium on Artificial Neural Networks.

本文作者:西局书局(今日头条)
热门推荐