支持向量机

支持向量机(SVM)是机器学习的主流技术之一。很多学者认为它是监督学习最好的定式算法。它的基本思想是在正确分类的前提下,尽可能将不同类分开。为了突破线性不可分的限制,SVM采用核函数技巧,将数据从原特征空间映射到高维特征空间,使得不同类数据变得可分。SVM方法具有十分严格的数学理论基础,且在实际应用中泛化能力强,因此是监督学习领域中最受欢迎的方法之一。

基本原理及算法

原问题及对偶问题

SVM的原问题是在正确划分标签的约束条件下,最大化两个异类支持向量到超平面之间的间隔,数学上该问题可以描述为: \[ min_{\mathbf{w},b} \frac{1}{2} \|\mathbf{w}\|^2 \\ s.t.\ y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1, i=1, 2, ..., m. \] 将带约束条件的反问题转化为无约束条件反问题。引入拉格朗日乘子\(\alpha_i \geq 0, i=1,2,...,m\),对应的拉格朗日函数可写为: \[ L(\mathbf{w}, b, \mathbf{\alpha}) = \frac{1}{2} \| \mathbf{w} \|^2 - \sum_{i=1}^{m} \alpha_i \left (y_i (\mathbf{w}^T \mathbf{x}_i + b) -1 \right) \] 分别关于\(\mathbf{w}\)\(b\)求偏导数: \[ \mathbf{w} = \sum_{i=1}^{m} \alpha_i y_i \mathbf{x}_i \\ 0 = \sum_{i=1}^{m} \alpha_i y_i \] 带入上面\(\mathbf{w}\)的公式,可将拉格朗日函数改为: \[ L(\mathbf{\alpha}) = \sum_{i=1}^{m} \alpha_i - \frac{1}{2} \sum_{i=1}^{m} \sum_{j=1}^{m} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j^T \] 原问题可以转化为以下的对偶问题: \[ max_{\mathbf{\alpha}} \sum_{i=1}^{m} \alpha_i - \frac{1}{2} \sum_{i=1}^{m} \sum_{j=1}^{m} \alpha_i \alpha_j y_i y_j \mathbf{x}_i \mathbf{x}_j^T \\ s.t. \ \sum_{i=1}^m \alpha_i y_i = 0, \\ \] 需要注意的是:假如原问题在\(\mathbf{w}^*\)上取得极小值,由于拉格朗日函数第二项总是小于0,因此,最优解\(\alpha^*\)位于拉格朗日函数的极大值点上。 解出\(\mathbf{\alpha}\)后,求出\(\mathbf{w}\)\(b\)即可得到模型: \[ f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b = \sum_{i=1}^m \alpha_i y_i \mathbf{x}_i^T \mathbf{x} +b. \] 另外,求解对偶问题得到的\(\alpha_i\)需要满足KKT(Karush-Kuhn-Tucker)条件, \[ \alpha_i \geq 0, i=1, 2, ..., m \\ y_i f(\mathbf{x}_i) - 1 \geq 0 \\ \alpha_i (y_i f(\mathbf{x}_i) - 1) = 0 \] 其中KKT第一个条件是拉格朗日乘子本身的要求,第二个条件是原问题的约束条件,第三个条件使得对偶问题的目标函数和原问题的目标函数相同。

软边界及正则化

前面假定训练样本是线性可分的,然而实际应用中往往很难找到一个超平面(甚至核函数)将不同类的样本完全分开;退一步说,即便找了了某个超平面或者核函数可以将训练集在特征空间中分开,但是也很难断定这不是过拟合产生的。缓解这个问题的一个办法是引入“软间隔”的概念,以允许SVM在某些样本上出错。 数学上,目标函数变为: \[ min_{\mathbf{w},b, \xi_i} \frac{1}{2} \|\mathbf{w}\|^2 + C \sum_{i=1}^{m} \xi_i \\ s.t.\ y_i(\mathbf{w}^T \mathbf{x}_i + b) + \xi_i - 1 \geq 0, i=1, 2, ..., m. \\ \xi_i \geq 0, i=1, 2, ..., m. \] 上式实际上用hinger函数替代0/1损失函数推导之后得到的结果。
引入拉格朗日乘子\(\alpha_i \geq 0,\beta_i \geq 0\),拉格朗日函数为: \[ L(\mathbf{w}, b, \xi_i, \mathbf{\alpha}, \mathbf{\beta}) = \frac{1}{2} \| \mathbf{w} \|^2 + C \sum_{i=1}^{m} \xi_i - \sum_{i=1}^{m} \alpha_i \left (y_i (\mathbf{w}^T \mathbf{x}_i + b) + \xi_i -1 \right) - \sum_{i=1}^{m} \beta_i \xi_i \\ = \frac{1}{2} \| \mathbf{w} \|^2 + \sum_{i=1}^{m} (C-\alpha_i -\beta_i) \xi_i - \left (\sum_{i=1}^m \alpha_i y_i \mathbf{x}_i^T \right) \mathbf{w} - \left( \sum_{i=1}^m \alpha_i y_i \right) b + \sum_{i=1}^m \alpha_i \] 分别关于\(\mathbf{w}, b, \xi_i\)求偏导数: \[ \mathbf{w} = \sum_{i=1}^{m} \alpha_i y_i \mathbf{x}_i \\ 0 = \sum_{i=1}^{m} \alpha_i y_i \\ \alpha_i = C - \mu_i \] 将上面的公式带入拉格朗日函数中,得到: \[ L(\mathbf{\alpha}) = \sum_{i=1}^{m} \alpha_i - \frac{1}{2} \sum_{i=1}^{m} \sum_{j=1}^{m} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j^T \] 原问题可以转化为以下的对偶问题: \[ max_{\mathbf{\alpha}} \sum_{i=1}^{m} \alpha_i - \frac{1}{2} \sum_{i=1}^{m} \sum_{j=1}^{m} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j \\ s.t. \ \sum_{i=1}^m \alpha_i y_i = 0, \\ \] 解出\(\mathbf{\alpha}\)后,求出\(\mathbf{w}\)\(b\)即可得到模型: \[ f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b = \sum_{i=1}^m \alpha_i y_i \mathbf{x}_i^T \mathbf{x} +b. \] 另外,求解对偶问题得到的\(\alpha_i, \beta_i, i=1, 2, ..., m\)需要满足KKT(Karush-Kuhn-Tucker)条件, \[ 0 \leq \alpha_i \leq C \\ \beta_i \geq 0 \\ \ y_i(\mathbf{w}^T \mathbf{x}_i + b) + \xi_i - 1 \geq 0 \\ \xi_i \geq 0 \\ \alpha_i \left (y_i (\mathbf{w}^T \mathbf{x}_i + b) + \xi_i -1 \right) = 0 \\ \beta_i \xi_i = 0 \]

SMO 算法

不管是硬边界还是软边界,对偶问题都是一个二次规划问题,可以使用通用的二次规划算法来求解。但是这个问题的规模正比于训练样本数,这在实际任务中造成很大的开销,为了避免这个问题,一种高效的做法是使用SMO(Sequential Minimal OPtimization)算法。 SMO算法的基本思想是固定\(a_i\)以外的所有参数,然后求\(\alpha_i\)上的极值。但是由于存在约束\(\sum_{i=1}^m \alpha_i y_i = 0\),如果固定除\(\alpha_i\)之外的其他参数,则\(\alpha_i\)可以直接求出,因此,SMO每次选择两个变量\(\alpha_i,\alpha_j\),并固定其他参数。这样,在参数初始化之后,SMO不断执行如下2个步骤直至收敛: - 选取一对待更新的变量\(\alpha_i, \alpha_j\) - 固定\(\alpha_i, \alpha_j\)以外的参数,更新\(\alpha_i, \alpha_j\) \[ \alpha_j^{new} = \alpha_j^{old} + \frac{y_j (E_j^{old} - E_i^{old})}{\eta} \\ \alpha_i^{new} = \alpha_i^{old} - \frac{s y_j (E_j^{old} - E_i^{old})}{\eta} \] 其中 \[ \eta = 2K_{ij} - K_{ii} - K_{jj} \\ s = y_i y_j \\ K_{ii} = \mathbf{x}_i^T \mathbf{x}_i \\ K_{ij} = \mathbf{x}_i^T \mathbf{x}_j \\ K_{jj} = \mathbf{x}_j^T \mathbf{x}_j \\ \] 在更新\(\alpha_j\)之后、\(\alpha_i\)之前,需要限制\(\alpha_j\)的范围: - 当\(s=1\)时,下边界为\(max(0, \alpha_i+\alpha_j-C)\),上边界为\(min(C, \alpha_i+\alpha_j)\) - 当\(s=-1\)时,下边界为\(max(0, \alpha_j-\alpha_i)\),上边界为\(min(C, C+\alpha_j-\alpha_i)\)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from numpy import *
def loadDataSet(fileName):
"""
打开数据文件
共3列,前2列为x1,x2,最后一列为标签
行数表示训练样本数
"""
dataMat = []; labelMat = []
fr = open(fileName)
for line in fr.readlines():
lineArr = line.strip().split('\t')
dataMat.append([float(lineArr[0]), float(lineArr[1])])
labelMat.append(float(lineArr[2]))
return dataMat, labelMat

def selectJrand(i, m):
"""
在[0,m)中任意选择一个不等于i的随机数
"""
j = i
while(j==i):
j = int(random.uniform(0,m))
return j

def clipAlpha(aj, H, L):
"""
将aj裁剪到范围[L, H]
"""
if aj > H:
aj = H
if aj < L:
aj = L
return aj
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
def smoSimple(dataMatIn, classLabels, C, toler, maxIter):
"""
dataMatIn: (m, n) 训练数据的特征
classLabels: (m,) 训练数据的标签
"""
dataMatrix = mat(dataMatIn) # 转成np.array格式
labelMat = mat(classLabels).transpose() # 转成np.array格式(m,1)
b = 0; m,n = shape(dataMatrix)
alphas = mat(zeros((m, 1)))
iter = 0

while (iter < maxIter):
alphaPairsChanged = 0
for i in range(m):
fXi = float(multiply(alphas, labelMat).T*\
(dataMatrix*dataMatrix[i,:].T)) + b # 带入\alpha_i后模型预测的标签
Ei = fXi - float(labelMat[i]) # 预测误差
if ((labelMat[i]*Ei < -toler) and (alphas[i] < C )) or \
((labelMat[i]*Ei > toler) and (alphas[i] > 0 )):
j = selectJrand(i, m)
fXj = float(multiply(alphas, labelMat).T*\
(dataMatrix*dataMatrix[j,:].T)) + b
Ej = fXj - float(labelMat[j])
alphaIold = alphas[i].copy()
alphaJold = alphas[j].copy()

# 设置\alpha_j的边界
if (labelMat[i] != labelMat[j]):
L = max(0, alphas[j] - alphas[i])
H = min(C, C + alphas[j] - alphas[i])
else:
L = max(0, alphas[j] + alphas[i] -C)
H = min(0, alphas[j] + alphas[i])
if L==H:
#print ("L==H")
continue

# 计算\eta
eta = 2.0 * dataMatrix[i, :] * dataMatrix[j, :].T - \
dataMatrix[i, :] * dataMatrix[i, :].T - \
dataMatrix[j, :] * dataMatrix[j, :].T
if eta >= 0:
#print("eta>=0")
continue

# 更新\alpha_j
alphas[j] -= labelMat[j] * (Ei - Ej)/eta
# 裁剪\alpha_j
alphas[j] = clipAlpha(alphas[j], H, L)
if (abs(alphas[j] - alphaJold) < 0.00001):
#print("j not moving enough")
continue

# 更新\alpha_i
alphas[i] += labelMat[j]*labelMat[i]*(alphaJold - alphas[j])

# 更新 b
b1 = b - Ei - labelMat[i] * (alphas[i] - alphaIold) * \
dataMatrix[i,:]*dataMatrix[i,:].T - \
labelMat[j]*(alphas[j] - alphaJold) * \
dataMatrix[i,:]*dataMatrix[j,:].T
b2 = b - Ej - labelMat[i] * (alphas[i] - alphaIold) * \
dataMatrix[i,:]*dataMatrix[j,:].T - \
labelMat[j]*(alphas[j] - alphaJold) * \
dataMatrix[j,:]*dataMatrix[j,:].T
if (alphas[i] > 0) and (alphas[i] < C):
b = b1
elif (alphas[j] > 0) and (alphas[j] < C):
b = b2
else:
b = (b1+b2)/2.0

# 更新alpha对的次数
alphaPairsChanged += 1
#print("iter: %d i:%d, pairs changed %d" % \
# (iter, i, alphaPairsChanged))
if (alphaPairsChanged == 0):
iter += 1
else:
iter = 0
#print("iteration number: %d" %iter)

return b, alphas
1
2
dataArr, labelArr = loadDataSet('testSet2.txt')
b, alphas = smoSimple(dataArr, labelArr, 0.6, 0.001, 40)
1
print(b,alphas[alphas>0]) #输出b和大于0的alpha
[[-3.76119735]] [[ 0.13268982  0.03391035  0.18867574  0.00656234  0.36183825]]
1
2
3
# 输出支持向量
for i in range(100):
if alphas[i] > 0.0: print (dataArr[i], labelArr[i])
[4.658191, 3.507396] -1.0
[3.223038, -0.552392] -1.0
[3.457096, -0.082216] -1.0
[2.893743, -1.643468] -1.0
[6.080573, 0.418886] 1.0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
class optStruct:
"""
自定义的结构体,用于存放和SMO算法相关的参数
"""
def __init__(self, dataMatIn, classLabels, C, toler):
self.X = dataMatIn
self.labelMat = classLabels
self.C = C
self.tol = toler
self.m = shape(dataMatIn)[0]
self.alphas = mat(zeros((self.m, 1)))
self.b = 0
self.eCache = mat(zeros((self.m, 2))) # 误差缓存,第一列表示是否有效的标志位,第二列为真实的E值

def calcEk(oS, k):
fXk = float(multiply(oS.alphas, oS.labelMat).T * \
(oS.X*oS.X[k,:].T)) + oS.b
Ek = fXk - float(oS.labelMat[k])
return Ek

def selectJ(i, oS, Ei):
"""
选择离目标函数Ei差距最大的j及其Ej
i: 输入的index
oS: 结构体
Ei: 输入目标函数值
"""
maxK = -1; maxDeltaE = 0; Ej = 0
oS.eCache[i] = [1, Ei] # 将输入值在缓存中设置为有效的,并保存该值
validEcacheList = nonzero(oS.eCache[:,0].A)[0] # 非零值所对应的下表

# 选择与Ei差别最大对饮的下表和E值
if (len(validEcacheList)) > 1:
for k in validEcacheList:
if k == i: continue
Ek = calcEk(oS, k)
deltaE = abs(Ei - Ek)
if (deltaE > maxDeltaE):
maxK = k; maxDeltaE = deltaE; Ej = Ek

# 第一次循环,随机选择一个alpha
else:
j = selectJrand(i, oS.m)
Ej = calcEk(oS, j)
return maxK, Ej

def updateEk(oS, k):
"""
计算index为k的目标函数,并存入结构体中
"""
Ek = calcEk(oS, k)
oS.eCache[k] = [1, Ek]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def innerL(i, oS):
"""
更新一对alphas以及b
"""
Ei = calcEk(oS, i)
if ((oS.labelMat[i]*Ei < -oS.tol) and (oS.alphas[i] < oS.C )) or \
((oS.labelMat[i]*Ei > oS.tol) and (oS.alphas[i] > 0 )):
j, Ej = selectJ(i, oS, Ei)
alphaIold = oS.alphas[i].copy(); alphaJold = oS.alphas[j].copy()
# 设置\alpha_j的边界
if (oS.labelMat[i] != oS.labelMat[j]):
L = max(0, oS.alphas[j] - oS.alphas[i])
H = min(oS.C, oS.C + oS.alphas[j] - oS.alphas[i])
else:
L = max(0, oS.alphas[j] + oS.alphas[i] - oS.C)
H = min(0, oS.alphas[j] + oS.alphas[i])
if L==H:
#print ("L==H")
return 0

# 计算\eta
eta = 2.0 * oS.X[i, :] * oS.X[j, :].T - \
oS.X[i, :] * oS.X[i, :].T - \
oS.X[j, :] * oS.X[j, :].T
if eta >= 0:
#print("eta>=0")
return 0

# 更新\alpha_j
oS.alphas[j] -= oS.labelMat[j] * (Ei - Ej)/eta
# 裁剪\alpha_j
oS.alphas[j] = clipAlpha(oS.alphas[j], H, L)
if (abs(oS.alphas[j] - alphaJold) < 0.00001):
#print("j not moving enough")
return 0

# 更新\alpha_i
oS.alphas[i] += oS.labelMat[j]*oS.labelMat[i]*(alphaJold - oS.alphas[j])

updateEk(oS, i)

# 更新 b
b1 = oS.b - Ei - oS.labelMat[i] * (oS.alphas[i] - alphaIold) * \
oS.X[i,:]*oS.X[i,:].T - \
oS.labelMat[j]*(oS.alphas[j] - alphaJold) * \
oS.X[i,:]*oS.X[j,:].T
b2 = oS.b - Ej - oS.labelMat[i] * (oS.alphas[i] - alphaIold) * \
oS.X[i,:]*oS.X[j,:].T - \
oS.labelMat[j]*(oS.alphas[j] - alphaJold) * \
oS.X[j,:]*oS.X[j,:].T

if (oS.alphas[i] > 0) and (oS.alphas[i] < oS.C):
oS.b = b1
elif (oS.alphas[j] > 0) and (oS.alphas[j] < oS.C):
oS.b = b2
else:
oS.b = (b1+b2)/2.0
return 1

else:
return 0

与原来的函数区别有: 1. 使用了oS类 2. 使用了selectJ而不是selectJrand来选择第2个alpha值 3. alpha值改变之后,更新了Ecache

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def smoP(dataMatIn, classLabels, C, toler, maxIter, kTup=('lin', 0)):
oS = optStruct(mat(dataMatIn), mat(classLabels).transpose(), C, toler)
iter = 0
entireSet = True; alphaPairsChanged = 0
while (iter < maxIter) and ((alphaPairsChanged > 0) or (entireSet)):
alphaPairsChanged = 0
if entireSet:
for i in range(oS.m):
alphaPairsChanged += innerL(i, oS)
#print("fullSet, iter: %d i: %d, pairs changed %d" %\
# (iter, i, alphaPairsChanged))
iter += 1
else:
nonBoundIs = nonzero((oS.alphas.A > 0) * (oS.alphas.A < C)) [0]
for i in nonBoundIs:
alphaPairsChanged += innerL(i, oS)
#print("non-bound, iter: %d i: %d, pairs changed %d" % \
# (iter, i, alphaPairsChanged))
iter += 1
if entireSet: entireSet = False
elif (alphaPairsChanged == 0): entireSet = True
#print("iteration number: %d" % iter)
return oS.b, oS.alphas
1
2
dataArr, labelArr = loadDataSet('testSet2.txt')
b, alphas = smoP(dataArr, labelArr, 0.6, 0.001, 40)
1
2
3
4
5
6
7
8
9
10
def calWs(alphas, dataArr, classLabels):
"""
根据alphas求w,需要注意的是,由于alphas大部分值为0,最终起作用的只有支撑向量。
"""
X = mat(dataArr); labelMat = mat(classLabels).transpose()
m, n = shape(X)
w = zeros((n,1))
for i in range(m):
w += multiply(alphas[i]*labelMat[i], X[i,:].T)
return w
1
2
ws = calWs(alphas, dataArr, labelArr)
ws
array([[ 0.53103017],
       [-0.19632991]])
1
2
3
# 预测分类,> 0为1类,< 0为-1类
datMat = mat(dataArr)
datMat[2]*mat(ws)+b
matrix([[ 1.50030532]])
1
2
# 真实类为:
labelArr[2]
1.0

核函数

前面的讨论假定训练样本线性可分,即存在一个超平面可以将训练样本正确分类。然而实际应用中,往往不存在这样一个超平面。例如“异或”问题就不是线性可分的。对这样的问题,一种解决方案是将原始空间映射到一个更高维度的特征空间。另\(\phi(\mathbf{x})\)表示将\(\mathbf{x}\)映射后的特征向量,则在特征空间中划分超平面所对应的模型可以表示为: \[ f(\mathbf{x}) = \mathbf{w}^T \phi(\mathbf{x}) + b \] 同样的可以推导得到对偶问题: \[ max_{\mathbf{\alpha}} \sum_{i=1}^{m} \alpha_i - \frac{1}{2} \sum_{i=1}^{m} \sum_{j=1}^{m} \alpha_i \alpha_j y_i y_j \phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j) \\ s.t. \ \sum_{i=1}^m \alpha_i y_i = 0, \\ \alpha_i \geq 0 \] 由于特征空间的维度可能很高,甚至是无限维,因此直接计算\(\phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j)\)通常是困难的。为了避免这个问题,可以假定这样一个函数: \[ \kappa(\mathbf{x})=\phi(\mathbf{x}_i)^T \phi(\mathbf{x}_j) \] 则, \[ max_{\mathbf{\alpha}} \sum_{i=1}^{m} \alpha_i - \frac{1}{2} \sum_{i=1}^{m} \sum_{j=1}^{m} \alpha_i \alpha_j y_i y_j \kappa(\mathbf{x}) \\ s.t. \ \sum_{i=1}^m \alpha_i y_i = 0, \\ \alpha_i \geq 0 \] 同样可以利用SMO算法求解\(\alpha_i\),然后带入: \[ f(\mathbf{x}) = \mathbf{w}^T \phi(\mathbf{x}) + b \\ = \sum_{i=1}^m \alpha_i y_i \phi(\mathbf{x}_i)^T \phi(\mathbf{x}) + b \\ = \sum_{i=1}^m \alpha_i y_i \kappa(\mathbf{x},\mathbf{x}_i) + b. \] 常用的核函数有多种,本文使用的是径向基(高斯)核函数: \[ \kappa(\mathbf{x}, \mathbf{x}_i) = exp \left( \frac{-\|\mathbf{x} - \mathbf{x}_i \|^2}{2\sigma^2} \right) \]

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
def kernelTrans(X, A, kTup):
"""
计算核函数
输入:
X: (m, n)
A: (1, n)
kTup: 元组类型,第一个参数核函数类型,这里只包括线性和高斯两种;第二个参数为方差\sigma
输出:
K: (m, 1)
"""
m, n = shape(X)
K = mat(zeros((m, 1)))
if kTup[0] == 'lin': K = X * A.T
elif kTup[0] == 'rbf':
for j in range(m):
deltaRow = X[j, :] - A # (1,n)
K[j] = deltaRow*deltaRow.T # (1,1)
K = exp(K/(-1*kTup[1]**2)) # (m,1)
else: raise NameError('That kernel is not recognized')
return K
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class optStruct:
"""
自定义的结构体,用于存放和SMO算法相关的参数
"""
def __init__(self, dataMatIn, classLabels, C, toler, kTup):
self.X = dataMatIn
self.labelMat = classLabels
self.C = C
self.tol = toler
self.m = shape(dataMatIn)[0]
self.alphas = mat(zeros((self.m, 1)))
self.b = 0
self.eCache = mat(zeros((self.m, 2))) # 误差缓存,第一列表示是否有效的标志位,第二列为真实的E值
# 新增参数
self.K = mat(zeros((self.m, self.m)))
for i in range(self.m):
self.K[:,i] = kernelTrans(self.X, self.X[i,:], kTup)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
def innerL(i, oS):
"""
更新一对alphas以及b
核函数方法只修改了两个部分:
1. 计算eta
2. 计算b1, b2
其他和原始方法完全一样
"""
Ei = calcEk(oS, i)
if ((oS.labelMat[i]*Ei < -oS.tol) and (oS.alphas[i] < oS.C )) or \
((oS.labelMat[i]*Ei > oS.tol) and (oS.alphas[i] > 0 )):
j, Ej = selectJ(i, oS, Ei)
alphaIold = oS.alphas[i].copy(); alphaJold = oS.alphas[j].copy()
# 设置\alpha_j的边界
if (oS.labelMat[i] != oS.labelMat[j]):
L = max(0, oS.alphas[j] - oS.alphas[i])
H = min(oS.C, oS.C + oS.alphas[j] - oS.alphas[i])
else:
L = max(0, oS.alphas[j] + oS.alphas[i] - oS.C)
H = min(0, oS.alphas[j] + oS.alphas[i])
if L==H:
#print ("L==H")
return 0

# 计算\eta
eta = 2.0 * oS.K[i, j] - oS.K[i, i] - oS.K[j, j] # 修改的部分
if eta >= 0:
#print("eta>=0")
return 0

# 更新\alpha_j
oS.alphas[j] -= oS.labelMat[j] * (Ei - Ej)/eta
# 裁剪\alpha_j
oS.alphas[j] = clipAlpha(oS.alphas[j], H, L)
if (abs(oS.alphas[j] - alphaJold) < 0.00001):
#print("j not moving enough")
return 0

# 更新\alpha_i
oS.alphas[i] += oS.labelMat[j]*oS.labelMat[i]*(alphaJold - oS.alphas[j])

updateEk(oS, i)

# 更新 b, 核函数方法做了部分修改
b1 = oS.b - Ei - oS.labelMat[i] * (oS.alphas[i] - alphaIold) * oS.K[i, i] - \
oS.labelMat[j]*(oS.alphas[j] - alphaJold) * oS.K[i, j]
b2 = oS.b - Ej - oS.labelMat[i] * (oS.alphas[i] - alphaIold) * oS.K[i, j] -\
oS.labelMat[j]*(oS.alphas[j] - alphaJold) * oS.K[j, j]

# 以下都不变
if (oS.alphas[i] > 0) and (oS.alphas[i] < oS.C):
oS.b = b1
elif (oS.alphas[j] > 0) and (oS.alphas[j] < oS.C):
oS.b = b2
else:
oS.b = (b1+b2)/2.0
return 1

else:
return 0

def calcEk(oS, k):
"""
计算预测误差
"""
fXk = float(multiply(oS.alphas, oS.labelMat).T * oS.K[:,k] + oS.b)
Ek = fXk - float(oS.labelMat[k])
return Ek
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def testRbf(k1=1.3):
dataArr, labelArr = loadDataSet('testSetRBF.txt')
# train the model using training dataset
b, alphas = smoP(dataArr, labelArr, 200, 0.0001, 10000, ('rbf', k1))
dataMat = mat(dataArr); labelMat = mat(labelArr).transpose()
svInd = nonzero(alphas.A>0)[0]
sVs = dataMat[svInd]
labelSv = labelMat[svInd]
print("there are %d support vectors" %shape(sVs)[0])
m, n = shape(dataMat)
errorCount = 0
for i in range(m):
kernelEval = kernelTrans(sVs, dataMat[i, :], ('rbf', k1))
predict = kernelEval.T * multiply(labelSv, alphas[svInd]) + b
if sign(predict) != sign(labelArr[i]): errorCount += 1
print("the training error rate is: %f" % (float(errorCount)/m))

# Test process
dataArr, labelArr = loadDataSet('testSetRBF2.txt')
errorCount = 0
dataMat = mat(dataArr); labelMat = mat(labelArr).transpose()
m, n = shape(dataMat)
for i in range(m):
kernelEval = kernelTrans(sVs, dataMat[i, :], ('rbf', k1))
predict = kernelEval.T * multiply(labelSv, alphas[svInd]) + b
if sign(predict) != sign(labelArr[i]): errorCount += 1
print("the test error rate is: %f" % (float(errorCount)/m))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def smoP(dataMatIn, classLabels, C, toler, maxIter, kTup=('lin', 0)):
oS = optStruct(mat(dataMatIn), mat(classLabels).transpose(), C, toler, kTup)
iter = 0
entireSet = True; alphaPairsChanged = 0
while (iter < maxIter) and ((alphaPairsChanged > 0) or (entireSet)):
alphaPairsChanged = 0
if entireSet:
for i in range(oS.m):
alphaPairsChanged += innerL(i, oS)
#print("fullSet, iter: %d i: %d, pairs changed %d" %\
# (iter, i, alphaPairsChanged))
iter += 1
else:
nonBoundIs = nonzero((oS.alphas.A > 0) * (oS.alphas.A < C)) [0]
for i in nonBoundIs:
alphaPairsChanged += innerL(i, oS)
#print("non-bound, iter: %d i: %d, pairs changed %d" % \
# (iter, i, alphaPairsChanged))
iter += 1
if entireSet: entireSet = False
elif (alphaPairsChanged == 0): entireSet = True
#print("iteration number: %d" % iter)
return oS.b, oS.alphas
1
testRbf()
there are 17 support vectors
the training error rate is: 0.000000
the test error rate is: 0.050000

实例:手写识别

1
2
3
4
5
6
7
8
9
10
11
def img2vector(filename):
"""
将32*32的图像文件转化成1*1024的向量
"""
returnVec = zeros((1, 1024))
fr = open(filename)
for i in range(32):
lineStr = fr.readline() #读取文件一行的信息
for j in range(32):
returnVec[0, 32*i+j] = int(lineStr[j])
return returnVec
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def loadImages(dirName):
from os import listdir
hwLabels = []
trainingFileList = listdir(dirName)
m = len(trainingFileList)
trainingMat = zeros((m, 1024))
for i in range(m):
fileNameStr = trainingFileList[i]
fileStr = fileNameStr.split('.')[0]
classNumStr = int(fileStr.split('_')[0])
if classNumStr == 9: hwLabels.append(-1)
else: hwLabels.append(1)
trainingMat[i, :] = img2vector('%s/%s' %(dirName, fileNameStr))

return trainingMat, hwLabels

def testDigits(kTup=('rbf', 10)):
dataArr, labelArr = loadImages('trainingDigits/')
b, alphas = smoP(dataArr, labelArr, 200, 0.0001, 100000, kTup)
dataMat = mat(dataArr); labelMat = mat(labelArr).transpose()
svInd = nonzero(alphas.A>0)[0]
sVs = dataMat[svInd]
labelSv = labelMat[svInd]
print("there are %d support vectors" %shape(sVs)[0])
m, n = shape(dataMat)
errorCount = 0
for i in range(m):
kernelEval = kernelTrans(sVs, dataMat[i, :], kTup)
predict = kernelEval.T * multiply(labelSv, alphas[svInd]) + b
if sign(predict) != sign(labelArr[i]): errorCount += 1
print("the training error rate is: %f" % (float(errorCount)/m))

# Test process
dataArr, labelArr = loadImages('testDigits/')
errorCount = 0
dataMat = mat(dataArr); labelMat = mat(labelArr).transpose()
m, n = shape(dataMat)
for i in range(m):
kernelEval = kernelTrans(sVs, dataMat[i, :], kTup)
predict = kernelEval.T * multiply(labelSv, alphas[svInd]) + b
if sign(predict) != sign(labelArr[i]): errorCount += 1
print("the test error rate is: %f" % (float(errorCount)/m))
1
testDigits(('lin', 100))
there are 23 support vectors
the training error rate is: 0.044776
the test error rate is: 0.043011
kenel, \(\sigma\) train error(%) test error(%) # support vectors
RBF, 1 0 52.2 402
RBF, 10 0 0.5 86
RBF, 50 0 1.07 30
RBF, 100 3.2 4.3 31
Linear 4.5 4.3 31

结论及讨论

结论

  • 优点:泛化错误率低,计算开销不大,结果容易解释
  • 缺点:对参数调节和核函数的选择敏感

讨论

  1. 本文讨论的是SMO算法的简化版本,还有一些优化技术并没有讨论,有兴趣者可以参考资料[3]
  2. 本文只讨论了二分类问题,实际上,利用"one-vs-all"策略也可以推广到多分类问题;
  3. 采用相同的思想,也可以解决回归问题,该方法称为支持向量回归(SVR),解决思路容许预测值和真实值之间最多有\(\epsilon\)的偏差,只有当误差大于\(\epsilon\)时才计算损失。\(\epsilon\)-不敏感损失函数定义为: \[ L_{\epsilon}(z) = 0, if |z| \leq \epsilon; \\ L_{\epsilon}(z) = |z| - \epsilon, otherwise. \]
  4. SVM也可以推广到非监督学习领域。

参考资料

  1. Peter Harrington, Machine learning in action, Manning Publications, 2012
  2. 周志华,机器学习,清华大学出版社, 2016
  3. Keerthi et al, Improvements to Platt's SMO Algorithm for SVM Classifier Design. Neural computation, 2001
  4. OF Svm, Sequential Minimal Optimization for SVM.